Session

Running LLaMA at Scale: Production Inference on Databricks Model Serving

Overview

ExperienceIn Person
TrackArtificial Intelligence & Agents
IndustryEnterprise Technology, Communications - Media & Entertainment
TechnologiesAI/BI
Skill LevelIntermediate

Serving LLMs at consumer scale stresses everything: throughput, latency, correctness and cost. At Superhuman—formerly Grammarly—we built a production system that serves a fine‑tuned LLaMA‑3B model for grammatical error correction to millions of monthly users on Databricks Model Serving. This talk shares what we validated and how: direct‑ingress patterns for high RPS, load testing toward 100K+ QPS targets, multi‑region failover and cold‑start mitigation, and observability. We’ll cover validation checks—A/B tests, golden sets—plus the cost model we used to compare against our internal vLLM-based stack, including levers like dynamic batching, autoscaling and speculative decoding. Expect concrete configs, dashboards and a candid readout of wins, incidents and unit‑economics tradeoffs that informed our ramp from pilot to majority traffic.

Session Speakers

Christoph Stuber

/Machine Learning Engineer
Superhuman

Mykhailo Troianovskyi

/Machine Learning Software Engineer
Superhuman