Name: Swin Base 4 12 384
Brand: Microsoft

Swin Base 4 12 384 is Microsoft's image to text model. A Swin Transformer-based image classification model using shifted window attention, configured for high-resolution 384×384 input with a base-scale architecture.

Specifications
Canonical ID	`microsoft-swin`
Type	Image to Text
Status	Active
Creator	Microsoft
Input Modalities	Image
Output Modalities	Text

Capabilities

Input1/5

Text·

Image✓

Audio·

Video·

PDF·

Output1/5

Text✓

Image·

Audio·

Video·

Embedding·

Capabilities0/13

Reasoning·

Adaptive Reasoning·

Function Calling·

Parallel Function Calling·

Structured Outputs·

Native JSON Schema·

Web Search·

URL Context·

Computer Use·

Code Execution·

File Search·

Prompt Caching·

Assistant Prefill·

Versions

Version	Released	Context	Input / 1M	Output / 1M	Status
Swin Base 4 12 384	—	—	—	—	Current
Swin Large	—	—	—	—	Available
Swin S3 Base	—	—	—	—	Available
Swin S3 Small	—	—	—	—	Available
Swin S3 Tiny	—	—	—	—	Available
Swin Small	—	—	—	—	Available
Swin Tiny	—	—	—	—	Available

Model IDs

amazon_sagemaker/tensorflow-ic-swin-base-patch4-window12-384
amazon_sagemaker/tensorflow-ic-swin-base-patch4-window7-224
microsoft-swin