Keywords: Large Lanugage Model, Key-value cache, Memory Constraint, Online scheduling
TL;DR: LLM inference scheduling with KV-cache memory constraints
Abstract: Large Language Model (LLM) inference faces unique scheduling challenges due to the dynamically growing Key-Value (KV) cache during token generation, making traditional scheduling algorithms ineffective. We develop a fluid dynamics approximation to establish an optimal throughput benchmark and propose the WAIT (Waiting for Accumulated Inference Threshold) algorithm that achieves near-optimal performance with near-optimal throughput gap. For practical scenarios with unknown output lengths, we introduce Nested WAIT that maintains asymptotic optimality through hierarchical segmentation. Experiments on Llama-7B demonstrate 20-30\% throughput improvements over state-of-the-art systems like vLLM.
Submission Number: 17
Loading