Session
Scaling Custom LLMs with vLLM and Databricks Model Serving: Fast, Flexible, and Production-Ready
Overview
| Experience | In Person |
|---|---|
| Track | Artificial Intelligence & Agents |
| Industry | Consulting & Services |
| Technologies | Unity Catalog |
| Skill Level | Advanced |
Databricks Model Serving supports deployments ranging from classical ML models on custom-CPU workloads to foundation models with dedicated provisioned throughput endpoints. But what about the use-cases that need any of the 1000s of other open-source or fine-tuned LLMs? Deploying them efficiently can be challenging.This breakout session explores deploying LLMs on custom-GPU endpoints with vLLM. In it, we’ll examine:- Serverless GPU Compute: How it simplifies the deployment process, saving hours on configuration work alone.- vLLM + GPU workloads: How they work in tandem on the platform to deliver high-throughput inference with scalable infrastructure.- Implementation: A step-by-step code walkthrough for packaging models, configuring vLLM runtime, and deployment! Serving LLMs on GPUs doesn’t have to be scary - learn more about how Databricks enables you to deploy even your most demanding model serving workloads!
Session Speakers
Colton Peltier
/Senior Staff AI FDE
Databricks
Mohamad Aboufoul
/Senior AI FDE
Databricks