PaddleOCR VL is PaddlePaddle (Baidu)'s image to text model with a 16K context window and up to 16K output tokens, starting at $0.020 / 1M input and $0.020 / 1M output. A vision-language model from PaddlePaddle tailored for document parsing and optical character recognition across diverse document layouts.
Specifications
Canonical IDpaddlepaddle-paddleocr-vl
TypeImage to Text
StatusActive
CreatorPaddlePaddle (Baidu)PaddlePaddle (Baidu)
Providers
Context Window16K tokens
Max Output16K tokens
Input ModalitiesImage
Output ModalitiesText

Capabilities

Input1/5
Text·
Image
Audio·
Video·
PDF·
Output1/5
Text
Image·
Audio·
Video·
Embedding·
Capabilities0/13
Reasoning·
Adaptive Reasoning·
Function Calling·
Parallel Function Calling·
Structured Outputs·
Native JSON Schema·
Web Search·
URL Context·
Computer Use·
Code Execution·
File Search·
Prompt Caching·
Assistant Prefill·

Pricing by Provider

ProviderStandard
Input
$ / 1M
Output
$ / 1M
Novita logo
Novita
novita/paddlepaddle/paddleocr-vl
$0.020$0.020

Cost Calculator

Preset:

Versions

VersionReleasedContextInput / 1MOutput / 1MStatus
PaddleOCR VL16K$0.020$0.020Current
PaddleOCR 0.9B VLAvailable

Model IDs