SESSION

Accelerating LLM Inference with vLLM

Accept Cookies to Play Video

OVERVIEW

EXPERIENCEIn Person
TYPEBreakout
TRACKData Science and Machine Learning
INDUSTRYEnterprise Technology
TECHNOLOGIESAI/Machine Learning, GenAI/LLMs
SKILL LEVELIntermediate
DURATION40 min

vLLM is an open-source highly performant engine for LLM inference and serving developed at UC Berkeley. vLLM has been widely adopted across the industry, with 12K+ GitHub stars and 150+ contributors worldwide. Since its initial release, the vLLM team has improved performance by more than 10x. This session will cover various topics in LLM inference performance, including paged attention and continuous batching. Then, we will focus on new innovations we’ve made to vLLM and the technical challenges behind them, including: Speculative Decoding, Prefix Caching, Disaggregated Prefill, and multi-accelerator support. The session will conclude with industry case studies of vLLM and future roadmap plans. Takeaways:

 

  • vLLM is an open source engine for LLM inference and serving, providing state-of-the-art performance and an accelerator-agnostic design.
  • In focusing on production-readiness and extensibility, vLLM’s design choices have led to new system insights and rapid community adoption.

SESSION SPEAKERS

Zhuohan Li

/PhD student
UC Berkeley / vLLM

Cade Daniel

/Software Engineer
Anyscale