Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Vikranth Srivatsa; Zijian He; Reyna Abhyankar; Dongming Li; Yiying Zhang

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang

Published: 22 Jan 2025, Last Modified: 07 Apr 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM prefix caching, LLM serving, Distributed systems for ML

TL;DR: This paper proposes Preble, the first distributed LLM serving platform that targets prompt sharing and improves SOTA serving systems by up to 14.5× on average latency.

Abstract: Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today’s practices are to include domain-specific instructions, illustration of tool usages, and/or long context such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests. Recent works propose to cache and reuse KV state of prompts. However, they are all confined to a single- GPU optimization, while production LLM serving systems are distributed by nature. This paper proposes Preble, the first distributed LLM serving platform that targets and op- timizes for prompt sharing. We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism. Our evaluation of Preble with real workloads and re- quest arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5× to 14.5× on average latency and 2× to 10× on p99 latency.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8281

Loading