LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

ACL ARR 2025 February Submission3379 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: KV cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a principled framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric for comparing cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. Our method (named LAVa) is theoretically grounded and simple, requiring no parameter tuning. Experiments on LongBench and Needle-in-a-Haystack benchmarks demonstrate its superiority over strong baselines. Notably, we find that dynamic layer budgets are crucial for generation tasks (e.g. code completion), whereas dynamic head budgets are important for extraction tasks (e.g. extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types and LLM architectures.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: NLP in resource-constrained settings,efficient models
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Theory
Languages Studied: English
Submission Number: 3379
Loading