Skip to main content

Model Serving

Unified deployment and governance for all AI models

video thumb


Databricks Model Serving is a unified service for deploying, governing, querying and monitoring models fine-tuned or pre-deployed by Databricks like Meta Llama 3, DBRX or BGE, or from any other model provider like Azure OpenAI, AWS Bedrock, AWS SageMaker and Anthropic. Our unified approach makes it easy to experiment with and productionize models from any cloud or provider to find the best candidate for your real-time application. You can do A/B testing of different models and monitor model quality on live production data once they are deployed. Model Serving also has pre-deployed models such as Llama2 70B, allowing you to jump-start developing AI applications like retrieval augmented generation (RAG) and provide pay-per-token access or pay-for-provisioned compute for throughput guarantees.

Customer Quotes

Simplified deployment

Simplified deployment for all AI models

Deploy any model type, from pretrained open source models to custom models built on your own data — on both CPU and GPU. Automated container build and infrastructure management reduce maintenance costs and speed up deployment so you can focus on building your AI projects and delivering value faster for your business.

Serving endpoints graphic image

Unified management for all models

Manage all models including custom ML models like PyFunc, scikit-learn and LangChain, foundation models (FMs) on Databricks like Llama 2, MPT and BGE, and foundation models hosted elsewhere like ChatGPT, Claude 2, Cohere and Stable Diffusion. Model serving makes all models accessible in a unified user interface and API, including models hosted by Databricks, or from another model provider on Azure and AWS.

Create serving endpoint graphic image

Governance built-in

Meet stringent security and governance requirements, because you can enforce proper permissions, monitor model quality, set rate limits, and track lineage across all models whether they are hosted by Databricks or on any other model provider.

Unified with Lakehouse data

Data-centric models

Accelerate deployments and reduce errors through deep integration with the Data Intelligence Platform. You can easily host various generative AI models, augmented (RAG) or fine-tuned with their enterprise data. Model Serving offers automated lookups, monitoring and governance across the entire AI lifecycle.



Serve models as a low-latency API on a highly available serverless service with both CPU and GPU support. Effortlessly scale from zero to meet your most critical needs — and back down as requirements change. You can get started quickly with one or more pre-deployed models and pay-per-token (on demand with no commitments) or pay-for-provisioned compute workloads for guaranteed throughput. Databricks will take care of infrastructure management and maintenance costs, so you can focus on delivering business value.

Ready to get started?