everyone
since 05 Mar 2025">EveryoneRevisionsBibTeXCC BY 4.0
To meet high user demand, production LLM inference systems use data parallelism to allocate the request pool evenly across multiple GPUs. However, in modern AI applications like chatbots, code generation, search engines, or agents, prompt prefixes are often shared, allowing for improved performance if requests with shared prefixes are assigned to the same GPU and the intermediate KV cache is reused. Previous work like Preble has developed distributed LLM serving platforms that optimize for prompt sharing; however, we hypothesized that additional performance gains could be achieved by integrating prefix-aware and output length-aware scheduling. To that end, we extend the adaptive prefix-aware scheduler of Preble to account for output length, which can be estimated using a lightweight BERT model or other cheap predictor. To benchmark this modification to Preble, we also build on Preble's online LLM inference simulation to support overhead tracking, variable output lengths, experiment caching, and data analysis. This simulation platform allows us to demonstrate that including output length in the per-GPU load calculation improves the performance of Preble, with 14.31% and 28.89% reduced latency at 64 and 128 requests per second, respectively, on 8 GPUs. Thus, considering both output length and shared prefixes may enable improved efficiency of online LLM inference in high demand settings.