Prefix and Output Length-Aware Scheduling for Efficient Online LLM Inference

Published: 05 Mar 2025, Last Modified: 10 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: large language model, inference, scheduling, prefix, output length
TL;DR: We improve the performance of a distributed LLM serving platform by considering both prefix sharing and output length in scheduling.
Abstract:

To meet high user demand, production LLM inference systems use data parallelism to allocate the request pool evenly across multiple GPUs. However, in modern AI applications like chatbots, code generation, search engines, or agents, prompt prefixes are often shared, allowing for improved performance if requests with shared prefixes are assigned to the same GPU and the intermediate KV cache is reused. Previous work like Preble has developed distributed LLM serving platforms that optimize for prompt sharing; however, we hypothesized that additional performance gains could be achieved by integrating prefix-aware and output length-aware scheduling. To that end, we extend the adaptive prefix-aware scheduler of Preble to account for output length, which can be estimated using a lightweight BERT model or other cheap predictor. To benchmark this modification to Preble, we also build on Preble's online LLM inference simulation to support overhead tracking, variable output lengths, experiment caching, and data analysis. This simulation platform allows us to demonstrate that including output length in the per-GPU load calculation improves the performance of Preble, with 14.31% and 28.89% reduced latency at 64 and 128 requests per second, respectively, on 8 GPUs. Thus, considering both output length and shared prefixes may enable improved efficiency of online LLM inference in high demand settings.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 72
Loading