VAM: Value-Attention Merging for KV Cache Optimization in LLMs

ICLR 2026 Conference Submission16725 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, KV Cache, Self Attention
TL;DR: VAM preserves context semantics in KV cache for better long-text inference, complementing compression methods and boosting LLM performance.
Abstract: Efficient key-value (KV) cache management is essential for large language models (LLMs) performing long-text inference. Traditional methods, which retain all original KV pairs, lead to high memory usage and degraded performance due to outdated contextual representations. While existing solutions predominantly focus on cache eviction or compression to reduce memory and computation, they largely neglect the issue of semantic degradation in the cache itself. In this paper, we identify two critical limitations in long-context inference—Progressive Clustering and Context Degradation—which cause the model to lose global contextual awareness over time. To address these issues, we propose VAM, a plug-and-play KV cache optimization algorithm that dynamically merges attention outputs into value states. Unlike cache compression methods that aim to reduce cache size, VAM specifically targets the preservation of contextual semantics in the cached representations, thereby improving the model’s ability to retain and utilize long-range dependencies. VAM is lightweight, easy to integrate, and complementary to existing compression strategies. Experiments on LongBench tasks across LLaMA and Mistral models (7B–70B) show consistent improvements of 0.36–6.45 in absolute score (0.64\%–4.26\% relative), and up to 8.33\% when combined with state-of-the-art KV compression methods, demonstrating VAM's effectiveness in enhancing long-sequence inference quality. Our code is available at https://anonymous.4open.science/r/vam-torch-386B/.
Primary Area: generative models
Submission Number: 16725
Loading