Mission-Critical Inference: Powering High-Scale AI in Production

Overview
| Experience | In Person |
|---|---|
| Track | Artificial Intelligence & Agents |
| Industry | Enterprise Technology |
| Technologies | Databricks Agents |
| Skill Level | Advanced |
The explosive growth in token consumption has made deploying LLMs at scale increasingly complex. Teams must balance latency, cost, and reliability. Databricks Model Serving is designed to absorb these operational challenges across the full model spectrum: proprietary APIs, open-source foundation models, fine-tuned variants, and fully custom architectures.
In this talk, we'll go under the hood of how we built token- and cache-aware routing, intelligent scheduling, and autoscaling to deliver efficient, resilient inference at scale. We'll then show how we extended this infrastructure with high-throughput batch support through the ai_query function in Databricks SQL, bringing LLM capabilities natively into data pipelines. Finally, we'll explore how we enable customers to serve their own post-trained/finetuned models or custom transformer-based architectures.
Whether you're optimizing costs, scaling to millions of requests, or bringing a custom architecture to production, this session offers a clear picture of what production-grade LLM serving looks like end to end.
Session Speakers
Brian Law
/Sr. Specialist Solutions Architect
Databricks
Mike Eastham
/Senior Staff Software Engineer
Databricks