Session
Scaling Custom LLMs with vLLM and Databricks Model Serving: Fast, Flexible, and Production-Ready
Overview
| Experience | In Person |
|---|---|
| Track | Artificial Intelligence & Agents |
| Industry | Consulting & Services |
| Technologies | Databricks Agents |
| Skill Level | Advanced |
Databricks Model Serving supports deployments ranging from classical ML models on custom-CPU workloads to foundation models with dedicated provisioned throughput endpoints. But what about the use-cases that need any of the 1000s of other open-source or fine-tuned LLMs? Deploying them efficiently can be challenging.
This breakout session explores deploying LLMs on custom-GPU endpoints with vLLM. In it, we’ll examine:
- Serverless GPU Compute: How it simplifies the deployment process, saving hours on configuration work alone.
- vLLM + GPU workloads: How they work in tandem on the platform to deliver high-throughput inference with scalable infrastructure.
- Implementation: A step-by-step code walkthrough for packaging models, configuring vLLM runtime, and deployment!
Serving LLMs on GPUs doesn’t have to be scary. Learn more about how Databricks enables you to deploy even your most demanding model serving workloads!
Session Speakers
Colton Peltier
/Senior Staff AI FDE
Databricks
Mohamad Aboufoul
/Senior AI FDE
Databricks