Optimizing LLM Inference: Fluid-Based Online Scheduling under Memory Constraints

Ruicheng Ao; Gan Luo; David Simchi-Levi; Xinshang Wang

Optimizing LLM Inference: Fluid-Based Online Scheduling under Memory Constraints

Ruicheng Ao, Gan Luo, David Simchi-Levi, Xinshang Wang

Published: 28 Nov 2025, Last Modified: 30 Nov 2025NeurIPS 2025 Workshop MLxOREveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Lanugage Model, Key-value cache, Memory Constraint, Online scheduling

TL;DR: LLM inference scheduling with KV-cache memory constraints

Abstract: Large Language Model (LLM) inference faces unique scheduling challenges due to the dynamically growing Key-Value (KV) cache during token generation, making traditional scheduling algorithms ineffective. We develop a fluid dynamics approximation to establish an optimal throughput benchmark and propose the WAIT (Waiting for Accumulated Inference Threshold) algorithm that achieves near-optimal performance with near-optimal throughput gap. For practical scenarios with unknown output lengths, we introduce Nested WAIT that maintains asymptotic optimality through hierarchical segmentation. Experiments on Llama-7B demonstrate 20-30\% throughput improvements over state-of-the-art systems like vLLM.

Submission Number: 17

Loading