Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Large Language Models (LLMs) are reshaping how we build applications; however, efficiently serving them at scale remains a major challenge.
The vLLM serving engine, historically focused on single-node deployments, is now being extended into a full-stack inference system through our open-source project, vLLM Production Stack. This extension enables any organization to deploy vLLM at scale with high reliability, high throughput, and low latency. Code: https://github.com/vllm-project/production-stack
At a high level, the vLLM Production Stack project allows users to easily deploy to their Kubernetes cluster through a single command. vLLM Production Stack's optimizations include KV cache sharing to speed up inference (https://github.com/LMCache/LMCache), prefix-aware routing that directs inference queries to vLLM instances holding the corresponding KV caches, and robust observability features for monitoring engine status and autoscaling.
Attendees will discover best practices and see real-time demonstrations of how these optimizations work together to enhance LLM inference performance.
Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Julian James Stephen, Michael Le
OSSNA 2025
Jose Manuel Bernabe' Murcia, Eduardo Canovas Martinez, et al.
MobiSec 2024
Sahil Suneja, Yufan Zhuang, et al.
ACM TOSEM