Qwen2.5 VL 72B Instruct is Alibaba's language model with a 131K context window and up to 8K output tokens, available from 7 providers, starting at $0.130 / 1M input and $0.400 / 1M output. A 72-billion-parameter multimodal vision-language LLM from Alibaba's Qwen2.5-VL series, delivering high-capacity image understanding and visual reasoning.
Specifications
Canonical IDalibaba-qwen2-5-vl-72b-instruct
TypeLanguage
StatusActive
CreatorAlibabaAlibaba
Providers
Context Window131K tokens
Max Output8K tokens
Input ModalitiesImageText
Output ModalitiesText
Parameters72B
HuggingFace Likes609
HuggingFace Downloads (30d)103,451
HuggingFace Downloads (all-time)5,812,114
Release Date · 1 year ago
Knowledge Cutoff

Capabilities

Input2/5
Text
Image
Audio·
Video·
PDF·
Output1/5
Text
Image·
Audio·
Video·
Embedding·
Capabilities3/13
Reasoning·
Adaptive Reasoning·
Function Calling
Parallel Function Calling·
Structured Outputs
Native JSON Schema
Web Search·
URL Context·
Computer Use·
Code Execution·
File Search·
Prompt Caching·
Assistant Prefill·

Pricing by Provider

Cost Calculator

Preset:

Versions

VersionReleasedContextInput / 1MOutput / 1MStatus
Qwen2.5 VL 72B Instruct131K$0.130$0.400Current
Qwen2.5 VL 32B Instruct131K$0.200$0.600Available
Qwen2.5 VL 3B Instruct131K$0.200$0.200Available
Qwen2.5 VL 7B Instruct131K$0.200$0.200Available
Rolm OCR128K$0.200$0.200Available

Model IDs