Keywords: LLM inference, Throughput optimization, Batched inference
TL;DR: This paper introduces multi-bin batching for LLM inference, grouping requests by output length to optimize throughput without fine-grained hardware control.
Abstract: As large language models (LLMs) grow in popularity for their diverse capabilities, improving the efficiency of their inference systems has become increasingly critical. Batching requests during LLM inference increases throughput by allowing multiple requests to be processed in parallel, making better use of hardware resources such as GPUs. However, the autoregressive nature of LLMs presents a challenge: requests often have varying execution times, causing resource underutilization, as hardware must wait for the longest-running request in the batch to complete before moving to the next batch. We propose Multi-Bin Batching, a simple yet effective method that can \emph{provably improve LLM inference throughput} by grouping requests with similar execution times into predetermined bins. We evaluate multi-bin batching on various settings, showing consistent throughput improvements compared to standard batching approaches.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1953
Loading