Session

Scaling Custom LLMs with vLLM and Databricks Model Serving: Fast, Flexible, and Production-Ready

Overview

ExperienceIn Person
TrackArtificial Intelligence & Agents
IndustryConsulting & Services
TechnologiesUnity Catalog
Skill LevelAdvanced
Databricks Model Serving supports deployments ranging from classical ML models on custom-CPU workloads to foundation models with dedicated provisioned throughput endpoints. But what about the use-cases that need any of the 1000s of other open-source or fine-tuned LLMs? Deploying them efficiently can be challenging.This breakout session explores deploying LLMs on custom-GPU endpoints with vLLM. In it, we’ll examine:- Serverless GPU Compute: How it simplifies the deployment process, saving hours on configuration work alone.- vLLM + GPU workloads: How they work in tandem on the platform to deliver high-throughput inference with scalable infrastructure.- Implementation: A step-by-step code walkthrough for packaging models, configuring vLLM runtime, and deployment! Serving LLMs on GPUs doesn’t have to be scary - learn more about how Databricks enables you to deploy even your most demanding model serving workloads!

Session Speakers

Speaker placeholderIMAGE COMING SOON

Colton Peltier

/Senior Staff AI FDE
Databricks

Mohamad Aboufoul

/Senior AI FDE
Databricks