Session
High-Throughput, Low-Latency: The Databricks Playbook for Production Model Serving
Overview
| Experience | In Person |
|---|---|
| Track | Artificial Intelligence & Agents |
| Industry | Enterprise Technology |
| Technologies | Agent Bricks |
| Skill Level | Intermediate |
Modern AI applications don’t just need accurate models, they need to serve them at massive scale, with low latency and strict reliability guarantees. In this session, we’ll go under the hood of Databricks Model Serving to show how we support high QPS (300K+ QPS), production-grade workloads on a unified Lakehouse platform. We’ll walk through the serving architecture, autoscaling strategies, and optimizations across GPU/CPU utilization, request routing, and caching that enable sustained high throughput without sacrificing latency or cost. You’ll see real patterns from customer deployments, learn how to design your endpoints for bursty and always-on traffic, and leave with practical guidance for running mission-critical ML and LLM workloads on Databricks Model Serving.
Session Speakers
Anshul Gupta
/Sr. Staff Software Engineer
Databricks
Mike Del Balso
/Databricks