IceCache: Memory-Efficient KV-cache Management for Long-Sequence LLMs

ICLR 2026 Conference Submission541 Authors

01 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Inference; KV-cahce Optimization; Sparse Attention
Abstract: Key-Value (KV) cache plays a pivotal role in accelerating inference in large language models (LLMs) by storing intermediate attention outputs, thereby avoiding redundant computation during auto-regressive generation. However, the cache's memory footprint scales linearly with sequence length, often resulting in memory bottlenecks on constrained hardware. While prior work has explored offloading KV-cache to the CPU and maintaining a reduced subset on the GPU, these approaches frequently suffer from imprecise token prioritization and degraded performance in long-generation tasks such as multi-turn dialogues and chain-of-thought reasoning. In this paper, we propose a novel KV-cache management strategy called IceCache, that integrates semantic token clustering with PagedAttention, a memory-efficient paging mechanism. By clustering semantically related tokens and organizing them into a hierarchical, dynamically updateable structure, our method improves cache hit rates and memory bandwidth utilization during CPU-GPU transfers. Experimental results show that IceCache achieves over 99\% accuracy with a 256-token budget and still maintains 97\% accuracy with only a 64-token budget, compared to the full KV-cache model. It outperforms existing baselines even while using just 25\% of the KV-cache token budget, demonstrating its superior accuracy in long-sequence scenarios.
Primary Area: generative models
Submission Number: 541
Loading