Session

Running LLaMA at Scale: Production Inference on Databricks Model Serving

Overview

Experience	In Person
Track	Artificial Intelligence & Agents
Industry	Enterprise Technology, Communications, Media & Entertainment
Technologies	AI/BI
Skill Level	Intermediate

Serving LLMs at consumer scale stresses everything: throughput, latency, correctness and cost. At Superhuman—formerly Grammarly—we built a production system that serves a fine‑tuned LLaMA‑3B model for grammatical error correction to millions of monthly users on Databricks Model Serving. This talk shares what we validated and how: direct‑ingress patterns for high RPS, load testing toward 100K+ QPS targets, multi‑region failover and cold‑start mitigation, and observability. We’ll cover validation checks—A/B tests, golden sets—plus the cost model we used to compare against our internal vLLM-based stack, including levers like dynamic batching, autoscaling and speculative decoding. Expect concrete configs, dashboards and a candid readout of wins, incidents and unit‑economics tradeoffs that informed our ramp from pilot to majority traffic.

Running LLaMA at Scale: Production Inference on Databricks Model Serving

Overview

Session Speakers

Christoph Stuber

Mykhailo Troianovskyi