Skip to main content

Model Serving

Simplified Production ML

illustration-nodes-1-gray
xl3KudEAtUwGELEYqQTaH8T-ykmt7YxB86i32cIBdkc

Introduction

Model Serving deploys any model (including large language models) as a REST API, allowing you to build real-time AI applications like personalized recommendations, customer service chatbots and fraud detection. With Model Serving, you can add the power of generative AI to your apps — all without the hassle of managing the infrastructure. Model Serving is built within Databricks Lakehouse Platform, offering native integration with MLflow and online data stores, automated scaling and built-in monitoring of deployed models.

Customer Quotes

Simplified deployment

Simplified deployment for all AI models

Deploy any model type, from pretrained open source models to custom models built on your own data — on both CPU and GPU. Automated container build and infrastructure management reduce maintenance costs and speed up deployment so you can focus on building your AI projects and delivering value faster for your business.

Unified with Lakehouse data

Unified with Lakehouse data

Accelerate deployments and reduce errors through deep integration with Lakehouse, which offers automated lookups, monitoring and governance across the entire AI lifecycle. With Unity Catalog integration, you get automatic governance and lineage across all your data, features and models.

real-time

Real-time

Serve models as a low-latency API on a highly available serverless service with both CPU and GPU support. Effortlessly scale from zero to meet your most critical needs — and back down as requirements change — paying only for the compute resources you utilize.

Optimized for LLMs

Reduce cost and latency through LLM (large language model) specific optimization tailored for select generative AI architectures. Benefit from these optimizations on popular open source models, such as Llama 2 and MPT, as well as those that are fine-tuned with your proprietary data.

Benchmarked on a Llama-2-13b using an A10 GPU instance with an output/input token ratio of 1024/128 and without quantization

Get started with these resources

eBook

mlops-virtual-event-thumb

Ready to get started?