Abstract: KV cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a principled framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric for comparing cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. Our method (named LAVa) is theoretically grounded and simple, requiring no parameter tuning. Experiments on LongBench and Needle-in-a-Haystack benchmarks demonstrate its superiority over strong baselines. Notably, we find that dynamic layer budgets are crucial for generation tasks (e.g. code completion), whereas dynamic head budgets are important for extraction tasks (e.g. extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types and LLM architectures.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: NLP in resource-constrained settings,efficient models
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Theory
Languages Studied: English
Submission Number: 3379
Loading