REAL: REtrieval-Augmented and Logic-constructed Attention Behaviors for Robust KV Cache Compression

18 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language model, Key-value, Attention behavior
TL;DR: REAL is an attention behavior-based approach with minimal model modification and without compromising accuracy, as in fullKV.
Abstract: The growing input sequence length of large language models (LLMs) places increasing pressure on key-value (KV) cache storage, making efficient inference challenging. Existing retrieval-based compression methods neglect the impact of distracted, biased, and widespread attention behaviors, raising robustness concerns. To address these challenges, this paper proposes REtrieval-Augmented and Logic-constructed (REAL) KV cache compression that implements a robust, low-cost, training-free method, capturing diverse attention behaviors. REAL introduces an attention weight confusion matrix (AWCM) to categorize attention behaviors and an inference score (INFsc) that balances retrieval and logic for head-wise dynamic budget allocation with an empirical per-layer safeguard. Experiments on long-sequence QA and non-QA tasks show that REAL achieves more robust compression than state-of-the-art baselines and even surpasses FullKV in certain situations. To our knowledge, REAL is the first approach to compress KV caches by attention behavior analysis, offering a new perspective.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 11457
Loading