DISCO: Dispersion-Guided Sparse KV Cache Compression for Efficient Long-Context Inference in Large Language Models

DISCO: Dispersion-Guided Sparse KV Cache Compression for Efficient Long-Context Inference in Large Language Models

ACL ARR 2026 January Submission5797 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Efficient Generative Inference, Key-Value Cache

Abstract: Large Language Models (LLMs) face severe memory bottlenecks in long-context inference due to the growth of the Key-Value (KV) cache. Existing approaches rely on attention heuristics and uniform layer-wise budgets, resulting in inefficient memory use and performance loss. We propose DISCO, a dispersion-guided KV cache compression framework that exploits intrinsic layer-wise redundancy in Transformer representations. DISCO grounds redundancy in geometric dispersion, allocates layer-wise KV budgets using PCA-based Dispersion Scores to prioritize information-rich layers, and applies a lightweight token eviction strategy that preserves dispersion by favoring geometrically informative tokens. Extensive experiments demonstrate that DISCO preserves model performance while using only 3.13\% of the original KV cache, significantly improving memory efficiency and throughput under fixed-budget settings. We further evaluate DISCO on a real-world medical benchmark, providing evidence that dispersion-based redundancy modeling preserves low-frequency, domain-critical information and remains effective in high-fidelity reasoning settings.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: pruning,LLM Efficiency,NLP in resource-constrained settings,efficient models,inference methods

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 5797

Loading