AUV: Efficient KV Cache Eviction for LLMs via Attention Score Aggregation and Usage Count

Published: 31 Oct 2025, Last Modified: 02 Feb 2026ICONIP 2025EveryoneCC BY 4.0
Abstract: As transformer-based large language models (LLMs) evolve, the need for efficient inference, particularly in managing the KV cache during the decoding phase, has intensified. However, the increasing sequence lengths in LLMs lead to substantial memory usage, which hinders their practical deployment. Existing cache eviction strategies, primarily based on accumulated attention scores, tend to prioritize early tokens, causing unstable performance and neglecting the integration of key metrics for dynamic cache management. To address these challenges, we propose AUV, an innovative cache eviction framework. AUV introduces a novel attention score aggregation method to mitigate the uneven eviction. It combines two complementary metrics: Total Attention Level and Strong Attention Frequency, by integrating the aggregated attention score with the usage count. Furthermore, AUV adopts a multi-step eviction strategy along with an eviction compensation mechanism to optimize both efficiency and accuracy. Finally, AUV improves KV cache management to avoid fragmentation. Through extensive experiments with OPT models, we demonstrate that AUV outperforms existing methods, including H2O, NACL, and full cache, under low cache budgets, achieving greater accuracy while maintaining throughput. Notably, AUV achieves a 10% point improvement over H2O in the BERTScore-F1 metric when retaining only 2% KV cache budget. These results highlight the potential of AUV to reduce memory consumption without sacrificing performance.
Loading