Name: Phi-4 Multimodal Instruct
Brand: Microsoft

Phi-4 Multimodal Instruct is Microsoft's language model with a 131K context window and up to 4K output tokens, starting at $0.080 / 1M input and $0.320 / 1M output. A multimodal instruction-tuned variant of Phi-4 capable of processing both text and visual inputs for a compact yet capable small language model.

Specifications
Canonical ID	`microsoft-phi-4-multimodal-instruct`
Type	Language
Status	Active
Creator	Microsoft
Providers	Microsoft Azure AI Foundry
Context Window	131K tokens
Max Output	4K tokens
Input Modalities	AudioImage
Output Modalities	Text

Capabilities

Input2/5

Text·

Image✓

Audio✓

Video·

PDF·

Output1/5

Text✓

Image·

Audio·

Video·

Embedding·

Capabilities1/13

Reasoning·

Adaptive Reasoning·

Function Calling✓

Parallel Function Calling·

Structured Outputs·

Native JSON Schema·

Web Search·

URL Context·

Computer Use·

Code Execution·

File Search·

Prompt Caching·

Assistant Prefill·

Pricing by Provider

Provider	Standard
Provider	Input $ / 1M	Output $ / 1M	Audio In $ / 1M
Azure AI Foundry azure_ai/Phi-4-multimodal-instruct	$0.080	$0.320	$4.00

Cost Calculator

Preset:

Input tokens

Output tokens

Number of calls

Versions

Version	Released	Context	Input / 1M	Output / 1M	Status
Phi-4 Mini Instruct	2025-10-17	131K	$0.075	$0.300	Available
Phi-4	2025-01-10	16K	$0.065	$0.140	Available
Phi-4 Multimodal Instruct	—	131K	$0.080	$0.320	Current
Phi-4 Multimodal	—	—	—	—	Available
Phi-4 Mini	—	—	—	—	Available
Phi-4 Eagle	—	—	—	—	Available
Phi-4 Mini MM	—	—	—	—	Available
Phi-4 Mini Reasoning	—	131K	$0.080	$0.320	Available
Phi-4 Reasoning	—	33K	$0.125	$0.500	Available
Phi-4 Reasoning Plus	—	—	—	—	Available
Phi-3 Mini	—	—	—	—	Available

Phi-4 Multimodal Instruct

Capabilities

Pricing by Provider

Cost Calculator

Versions

Model IDs