DiliLazyKV: Diligent–Lazy Head Effect on Robust KV Cache Compression

ACL ARR 2025 May Submission5488 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The increasing length of context windows in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference more challenging. Existing compression techniques, which operate at the token, layer, and head levels, often risk discarding valuable information and lack comprehensive adaptability. To overcome these limitations, this paper introduces DiliLazyKV, a novel two-stage approach that utilizes finer-grained functional adaptability based on the proposed Inference Score at the head-layer level. DiliLazyKV achieves greater compression while maintaining better performance across various tasks and longer contexts with $\beta$=1.351 in low resources (KV Size =64 \& 128), providing a robust KV cache compression strategy. The code is available at https://github.com/DiliLazyKV/DiliLazyKV.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Large Language Model, Key-Value Cache, Attention Mechanism, Attention Head
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Keywords: Large Language Model, Key-Value Cache, Attention Mechamism, Needle-in-a-Haystack
Submission Number: 5488
Loading