DISCO: Dispersion-Guided Sparse KV Cache Compression for Efficient Long-Context Inference in Large Language Models
Keywords: Large Language Model, Efficient Generative Inference, Key-Value Cache
Abstract: Large Language Models (LLMs) face severe memory bottlenecks in long-context inference due to the growth of the Key-Value (KV) cache. Existing approaches rely on attention heuristics and uniform layer-wise budgets, resulting in inefficient memory use and performance loss.
We propose DISCO, a dispersion-guided KV cache compression framework that exploits intrinsic layer-wise redundancy in Transformer representations. DISCO grounds redundancy in geometric dispersion, allocates layer-wise KV budgets using PCA-based Dispersion Scores to prioritize information-rich layers, and applies a lightweight token eviction strategy that preserves dispersion by favoring geometrically informative tokens. Extensive experiments demonstrate that DISCO preserves model performance while using only 3.13\% of the original KV cache, significantly improving memory efficiency and throughput under fixed-budget settings. We further evaluate DISCO on a real-world medical benchmark, providing evidence that dispersion-based redundancy modeling preserves low-frequency, domain-critical information and remains effective in high-fidelity reasoning settings.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: pruning,LLM Efficiency,NLP in resource-constrained settings,efficient models,inference methods
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5797
Loading