Efficient KV Cache Spillover Management on Memory-Constrained GPU for LLM Inference

Jiazhi Jiang, Yao Chen, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen, Hongbin Zhang, Jiangsu Du, Dan Huang, Yutong Lu

Published: 2026, Last Modified: 21 Jan 2026IEEE Trans. Parallel Distributed Syst. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The rapid growth of model parameters presents a significant challenge when deploying large generative models on GPU. Existing LLM runtime memory management solutions tend to maximize batch size to saturate GPU device utilization. Nevertheless, this practice leads to situations where the KV Cache of certain sequences cannot be accommodated on GPUs with limited memory capacity during the model inference, requiring temporary eviction from GPU memory (referred to as KV Cache spillover). However, without careful consideration of the LLM inference’s runtime pattern, current LLM inference memory management solutions face issues like one-size-fits-all spillover handling approach for different platforms, under-utilization of GPU in prefill stage, and suboptimal sequence selection due to direct employment of swap or recomputation. In this article, we introduce FuseSpill, a holistic KV Cache management solution designed to boost LLM inference on memory-constrained GPU by efficiently handling KV Cache spillover. Specifically, FuseSpill consists of a spillover cost model that analyzes the system cost of spillover handling techniques quantitatively, a KV cache swap orchestrator to further refine the basic swap technique to sophisticated disaggregate KV Cache across heterogeneous devices for decoding iterations, a multi-executor scheduler to effectively coordinate task executors across devices, and a response length predictor to exploit the length-aware sequence selection strategy when KV Cache spillover occurs. The experimental results demonstrate that our implementation outperforms existing solutions, delivering a 20% to 40% increase in throughput while simultaneously reducing inference latency of spillover sequences.

External IDs:dblp:journals/tpds/JiangCZHLLCZDHL26