Mosaic AI Foundation Model Serving
Serve state-of-the-art foundation models for both real-time and batch inference workload needs. This enables you to quickly and easily build applications that leverage high-quality generative AI models without the need to maintain your own model deployment.
* Displayed pricing does not guarantee product availability in that region. For product availability see here: AWS, Azure, GCP, SAP
1. Throughput in a single unit of PT capacity varies by model and query shape (input vs. output tokens). Please use the GenAI Calculator to estimate workload-specific throughput and total cost.
2. Hourly pricing is charged on a per-minute increment.
Foundation Model Serving DBU rates
Model | Pay-Per-Token | Provisioned Throughput | ||
---|---|---|---|---|
DBU / M input tokens | DBU / M output tokens | DBU / hour (entry capacity) | DBU / hour (scaling capacity) | |
Llama 4 Maverick | 7.143 | 21.429 | 85.714 | 85.714 |
Llama 3.3 70B | 7.143 | 21.429 | 85.714 | 342.857 |
GPT OSS 120B | 2.143 | 8.571 | 71.429 | 71.429 |
Gemma 3 12B | 2.143 | 7.143 | 71.429 | 71.429 |
Llama 3.1 8B | 2.143 | 6.429 | 53.571 | 106.000 |
GPT OSS 20B | 1.000 | 4.286 | 53.571 | 53.571 |
Llama 3.2 3B | n/a | n/a | 46.429 | 92.857 |
Llama 3.2 1B | n/a | n/a | 42.857 | 85.714 |
GTE | 1.857 | n/a | 20.000 | 20.000 |
BGE Large | 1.429 | n/a | 24.000 | 24.000 |
1: Entry capacity is the small, lower-cost PT capacity unit designed to provide a more affordable starting point for customers. These provide proportionally reduced throughput compared to the scaling capacity. These are only available in Azure and AWS for US, Canada and Brazil regions, and only for base (not fine-tuned) models.
2: Scaling capacity is the standard PT capacity increment that can be provisioned for a model. Beyond entry capacity (available in select clouds and regions), Provisioned Throughput capacity scales up and down in increments of these scaling capacity units. In clouds / regions where entry capacity is not available, the minimum PT purchase increment is the full scaling capacity unit.
Pay as you go with a 14-day free trial or contact us for committed-use discounts or custom requirements.