Phi-3.5 Vision Instruct is Microsoft's language model with a 128K context window and up to 4K output tokens, starting at $0.130 / 1M input and $0.520 / 1M output. An instruction-tuned multimodal Phi-3.5 model with vision capabilities for image understanding and visual question answering.
Specifications
Canonical IDmicrosoft-phi-3-5-vision-instruct
TypeLanguage
StatusActive
CreatorMicrosoftMicrosoft
Providers
Context Window128K tokens
Max Output4K tokens
Input ModalitiesImage
Output ModalitiesText

Capabilities

Input1/5
Text·
Image
Audio·
Video·
PDF·
Output1/5
Text
Image·
Audio·
Video·
Embedding·
Capabilities0/13
Reasoning·
Adaptive Reasoning·
Function Calling·
Parallel Function Calling·
Structured Outputs·
Native JSON Schema·
Web Search·
URL Context·
Computer Use·
Code Execution·
File Search·
Prompt Caching·
Assistant Prefill·

Pricing by Provider

ProviderStandard
Input
$ / 1M
Output
$ / 1M
Azure AI Foundry logo
Azure AI Foundry
azure_ai/Phi-3.5-vision-instruct
$0.130$0.520
View Azure AI Foundry

Cost Calculator

Preset:

Versions

VersionReleasedContextInput / 1MOutput / 1MStatus
Phi-4 Mini Instruct131K$0.075$0.300Available
Phi-416K$0.065$0.140Available
Phi-4 MultimodalAvailable
Phi-4 MiniAvailable
Phi-4 EagleAvailable
Phi-4 Mini MMAvailable
Phi-4 Mini Reasoning131K$0.080$0.320Available
Phi-4 Multimodal Instruct131K$0.080$0.320Available
Phi-4 Reasoning33K$0.125$0.500Available
Phi-4 Reasoning PlusAvailable
Phi-3.5 Vision Instruct128K$0.130$0.520Current

Model IDs