Keywords: Large Language Models, KV Cache Compression
TL;DR: We propose ApertureKV, a coverage-optimizing KV cache compression method that mitigates the Echo Chamber Effect and enables accurate long-context inference under tight memory budgets.
Abstract: Large language models (LLMs) have achieved strong performance on complex tasks ranging from multi-document reasoning to long-dependency question answering. To enable efficient inference, these models rely on key-value (KV) caching, which stores and reuses KV pairs to avoid redundant computation. As the sequence length grows, the KV cache increases linearly, creating a severe GPU memory bottleneck. This issue is commonly addressed by compressing the KV cache using a top-k selection based on attention scores. However, this strategy induces a homogeneity bias, the tendency to repeatedly select similar tokens, which creates an Echo Chamber Effect where the compressed KV cache is dominated by redundant information. This results in low effective coverage, causing crucial information to be lost and leading to verbose and logically broken answers under constrained token budgets. To address this, we propose ApertureKV, a KV cache compression method that employs coverage optimizing strategies to mitigate the Echo Chamber Effect. ApertureKV addresses two distinct sources of redundancy through two core components: Query Diversification (QD), which adjusts queries to encourage the retention of a more diverse set of tokens, and Redundancy-Aware Budget Allocation (RABA), which allocates more budget to heads that capture distinct information. By achieving highly effective coverage, ApertureKV enables robust KV cache compression under tight memory constraints, yielding more accurate responses. Evaluations on long-context benchmarks such as LongBench and LooGLE, including Needle-in-a-Haystack tasks, show that ApertureKV consistently outperforms state-of-the-art methods under tight budgets. In particular, on one LongBench sub-task with Mistral-7B-Instruct, ApertureKV retains 92.6\% of FullKV performance while using only 0.2\% of the KV cache budget.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 16403
Loading