The figure above provides an overview of the first approach: Chiron.
Chiron follows a hierarchical design to meet TTFT and ITL SLOs, while maximizing throughput in two ways. It scales the batch size of an individual instance via a local autoscaler, and also scales and orders requests for the interactive, mixed, and batch instances via a global orchestrator.
Each request is preferentially routed to its own instance type (interactive requests to interactive instances and batch requests to batch instances) leading to non-uniform routing requests in Chiron. If capacity is unavailable on its own instance type, they are routed to the mixed instances. Mixed instances enable multiplexing between interactive and batch requests and drive up overall cluster utilization. For interactive requests, the mixed instances can handle unpredictable spikes in request arrivals. For batch requests, the mixed instances provide additional running capacity when sufficient interactive requests are not present.
To enable this multiplexing between interactive and batch requests, while ensuring the immediate execution of interactive requests, mixed instances are preemptible. This means interactive requests can evict out batch requests and send them back into the global queue. To prevent throughput drop from such an eviction, we enable fast restart: We save the KV cache by migrating it to CPU memory.
The global autoscaler is based on the request waiting time estimation for the request queue. As the queue size grows larger, statistical effects of continuous batching allows Chiron to create a tighter bound on waiting time.
The figure above provides an overview of the second approach: QLM, designed for fixed capacity deployments. In addition to routing and eviction from Chiron, QLM also uses model swapping to share multiple models between the same serving instance.
Every incoming request is grouped with other requests that share common performance characteristics (such as model type, SLO value, and token distribution) to form request groups. Request groups are a useful abstraction to apply waiting time estimation. Requests in a group are then assigned to a virtual queue, representing a waiting queue for an LLM serving instance in the cluster. The ordering of the request groups in a virtual queue determines the execution ordering of the requests on the corresponding LLM serving instance. While requests are assigned to groups in a first-come-first-serve manner, the groups in a virtual queue are reordered by the global scheduler to maximize the SLO attainment for all requests being served.
In the above figure, we show an example workflow for Chiron and compare it against Llumnix, a state-of-the-art LLM orchestration system. Initially, the workload comprises only interactive requests arriving with a Gamma distribution with mean of 30 requests per second and CV of 4. Both Chiron and Llumnix would be over-provisioned in this scenario with an average of 15 GPUs. Note that we use the tuned version of Llumnix which has similar instance-level throughput as Chiron. At 5 minutes in, the batch request queue is populated with 1 million requests. Llumnix does not enable queuing for these batch requests and immediately starts adding instances over time to reduce GPU utilization until the maximum cluster capacity of 50 instances is reached. On the other hand, Chiron would maintain batch requests in the queue and prefer to multiplex with the over provisioned capacity of 10 GPUs (out of 15 GPUs).
As batch requests have a relaxed ITL SLO, Chiron’s local autoscaler is able to maintain a higher throughput of 20 requests per second on this over-provisioned capacity. After 50 minutes, Chiron’s waiting time estimation calculates that roughly 200,000 requests still remain to be processed and 10 new instances are added to finish the queue by the deadline. At 65 minutes, all requests are completed by Chiron. As Llumnix does not adapt the batch size for the newly added instances, it continues to serve the requests at a reduced throughput. Consequently, by the deadline of 65 minutes, only 50% of requests sent through Llumnix satisfy SLOs. Overall, in this scenario, Chiron uses 60% fewer GPU node hours while meeting all SLOs.
The benefits of QLM and Chiron from multiplexing, dynamic batch sizes, and model swapping translate into reduced serving costs as shown in the figure below. The workload is sampled from the shareGPT dataset with an equal split between batch and interactive requests.