Multimodal Embedding is Google's embedding model with a 2K context window, starting at $0.8 / 1M input. A multimodal embedding model that encodes both text and images into a shared vector space for cross-modal retrieval and similarity tasks.
Specifications
Canonical IDgoogle-multimodal-embedding
TypeEmbedding
StatusActive
CreatorGoogleGoogle
Providers
Context Window2K tokens
Input ModalitiesText
Output ModalitiesEmbedding
Embedding Dimensions768

Capabilities

Input1/5
Text
Image·
Audio·
Video·
PDF·
Output1/5
Text·
Image·
Audio·
Video·
Embedding
Capabilities0/13
Reasoning·
Adaptive Reasoning·
Function Calling·
Parallel Function Calling·
Structured Outputs·
Native JSON Schema·
Web Search·
URL Context·
Computer Use·
Code Execution·
File Search·
Prompt Caching·
Assistant Prefill·

Pricing by Provider

US Dollar ($)
Per 1M tokens
ProviderStandard
Input
Image In
$ / image
Video In
$ / sec
Google Vertex AI logo
Google Vertex AI
multimodalembedding
$0.000200
/ 1K chars
$0.000100$0.000500

Cost Calculator

US Dollar ($)
Preset:

Versions

VersionReleasedContextInput / 1MOutput / 1MStatus
Text Embedding 52K$0.025Available
Voyage 432K$0.060Available
Voyage 4 Large32K$0.120Available
Voyage 4 Lite32K$0.020Available
Embed 4128K$0.120$0.470Available
Embed 4 Img$0.470Available
Embed 4 Txt$0.120Available
Text Embedding 42K$0.100Deprecated
Voyage 3.532K$0.060Available
Voyage 3.5 Lite32K$0.020Available
Multimodal Embedding2K$0.800Current

Model IDs

google-multimodal-embedding
multimodalembedding
multimodalembedding@001
publishers/google/models/multimodalembedding