Model Serving
Simplified Production ML


Introduction
Model Serving deploys any model (including large language models) as a REST API, allowing you to build real-time AI applications like personalized recommendations, customer service chatbots and fraud detection. With Model Serving, you can add the power of generative AI to your apps — all without the hassle of managing the infrastructure. Model Serving is built within Databricks Lakehouse Platform, offering native integration with MLflow and online data stores, automated scaling and built-in monitoring of deployed models.
Customer Quotes

Simplified deployment for all AI models
Deploy any model type, from pretrained open source models to custom models built on your own data — on both CPU and GPU. Automated container build and infrastructure management reduce maintenance costs and speed up deployment so you can focus on building your AI projects and delivering value faster for your business.

Unified with Lakehouse data
Accelerate deployments and reduce errors through deep integration with Lakehouse, which offers automated lookups, monitoring and governance across the entire AI lifecycle. With Unity Catalog integration, you get automatic governance and lineage across all your data, features and models.

Real-time
Serve models as a low-latency API on a highly available serverless service with both CPU and GPU support. Effortlessly scale from zero to meet your most critical needs — and back down as requirements change — paying only for the compute resources you utilize.
Optimized for LLMs
Reduce cost and latency through LLM (large language model) specific optimization tailored for select generative AI architectures. Benefit from these optimizations on popular open source models, such as Llama 2 and MPT, as well as those that are fine-tuned with your proprietary data.
Benchmarked on a Llama-2-13b using an A10 GPU instance with an output/input token ratio of 1024/128 and without quantization