Session

Mission-Critical Inference: Powering High-Scale AI in Production

Register or Login

Overview

ExperienceIn Person
TrackArtificial Intelligence & Agents
IndustryEnterprise Technology
TechnologiesDatabricks Agents
Skill LevelAdvanced

The explosive growth in token consumption has made deploying LLMs at scale increasingly complex. Teams must balance latency, cost, and reliability. Databricks Model Serving is designed to absorb these operational challenges across the full model spectrum: proprietary APIs, open-source foundation models, fine-tuned variants, and fully custom architectures.


In this talk, we'll go under the hood of how we built token- and cache-aware routing, intelligent scheduling, and autoscaling to deliver efficient, resilient inference at scale. We'll then show how we extended this infrastructure with high-throughput batch support through the ai_query function in Databricks SQL, bringing LLM capabilities natively into data pipelines. Finally, we'll explore how we enable customers to serve their own post-trained/finetuned models or custom transformer-based architectures.


Whether you're optimizing costs, scaling to millions of requests, or bringing a custom architecture to production, this session offers a clear picture of what production-grade LLM serving looks like end to end.

Session Speakers

Brian Law

/Sr. Specialist Solutions Architect
Databricks

Mike Eastham

/Senior Staff Software Engineer
Databricks