Llama 3.2 11B Vision Instruct is Meta's language model with a 131K context window and up to 16K output tokens, available from 7 providers, starting at $0.015 / 1M input and $0.025 / 1M output. Meta's 11B instruction-tuned vision-language model optimized for visual recognition, image reasoning, and captioning with multimodal input support.
Specifications
Canonical IDmeta-llama-3-2-11b-vision-instruct
TypeLanguage
StatusActive
CreatorMetaMeta
Providers
Context Window131K tokens
Max Output16K tokens
Input ModalitiesImageText
Output ModalitiesText
Parameters11B
Release Date · 2 years ago
Knowledge Cutoff

Capabilities

Input2/5
Text
Image
Audio·
Video·
PDF·
Output1/5
Text
Image·
Audio·
Video·
Embedding·
Capabilities3/13
Reasoning·
Adaptive Reasoning·
Function Calling
Parallel Function Calling
Structured Outputs
Native JSON Schema·
Web Search·
URL Context·
Computer Use·
Code Execution·
File Search·
Prompt Caching·
Assistant Prefill·

Pricing by Provider

Cost Calculator

Preset:

Versions

VersionReleasedContextInput / 1MOutput / 1MStatus
Llama 3.2 11B Vision Instruct131K$0.015$0.025Current
Llama 3.2 90B Vision Instruct128K$0.900$0.900Available

Model IDs