DiliLazyKV: Diligent–Lazy Head Effect on Robust KV Cache Compression

DiliLazyKV: Diligent–Lazy Head Effect on Robust KV Cache Compression

ACL ARR 2025 May Submission5488 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The increasing length of context windows in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference more challenging. Existing compression techniques, which operate at the token, layer, and head levels, often risk discarding valuable information and lack comprehensive adaptability. To overcome these limitations, this paper introduces DiliLazyKV, a novel two-stage approach that utilizes finer-grained functional adaptability based on the proposed Inference Score at the head-layer level. DiliLazyKV achieves greater compression while maintaining better performance across various tasks and longer contexts with $\beta$=1.351 in low resources (KV Size =64 \& 128), providing a robust KV cache compression strategy. The code is available at https://github.com/DiliLazyKV/DiliLazyKV.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Large Language Model, Key-Value Cache, Attention Mechanism, Attention Head

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Keywords: Large Language Model, Key-Value Cache, Attention Mechamism, Needle-in-a-Haystack

Submission Number: 5488

Loading