Microsoft logo

Phi-4 Multimodal Instruct


Phi-4 Multimodal Instruct is Microsoft logoMicrosoft's language model with a 131K context window and up to 4K output tokens, starting at $0.080 / 1M input and $0.320 / 1M output. An instruction-tuned multimodal model in the Phi-4 series, combining vision and language capabilities for complex image-text reasoning and conversational tasks.
Spec
Canonical IDmicrosoft-phi-4-multimodal-instruct
TypeLanguage
StatusActive
CreatorMicrosoftMicrosoft
Providers
Context Window131K tokens
Max Output4K tokens
Input ModalitiesAudioImage
Output ModalitiesText

Capabilities

Input2/5
Text·
Image
Audio
Video·
PDF·
Output1/5
Text
Image·
Audio·
Video·
Embedding·
Capabilities1/13
Reasoning·
Adaptive Reasoning·
Function Calling
Parallel Function Calling·
Structured Outputs·
Native JSON Schema·
Web Search·
URL Context·
Computer Use·
Code Execution·
File Search·
Prompt Caching·
Assistant Prefill·

Pricing by Provider

ProviderStandard
Input
$ / 1M
Output
$ / 1M
Audio In
$ / 1M
Azure AI Foundry logo
Azure AI Foundry
Phi-4-multimodal-instruct
$0.080$0.320$4.00

Cost Calculator

Preset:
Compares every provider & tier in USD

Versions

VersionReleasedContextInput / 1MOutput / 1MStatus
Phi-416K$0.065$0.140Available
Phi-4 Multimodal Instruct131K$0.080$0.320Current
Phi-4 Mini Instruct131K$0.075$0.300Available
Phi-4 Mini Reasoning131K$0.080$0.320Available
Phi-4 Reasoning33K$0.125$0.500Available
Phi-3.5 Mini Instruct128K$0.130$0.520Available
Phi-3.5 MoE Instruct128K$0.160$0.640Available
Phi-3.5 Vision Instruct128K$0.130$0.520Available
Phi-3Available
Phi-3 Medium Instruct128K$0.170$0.680Available
Phi-3 Mini Instruct131K$0.100$0.100Available

Model IDs