HYBRIDKV: Exploiting Head-Dominant Reconstruction for Efficient Query-Agnostic KV Cache Compression

Published: 01 Jun 2026, Last Modified: 11 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: KV Cache Compression, Efficient Decoding
Abstract: Efficient key–value (KV) cache compression is crucial for large language models with long contexts. While context-reconstruction attention enables query-agnostic KV compression, its practical use is limited by large compression overhead, i.e., additional prefill-time computation required for reconstruction-based importance scoring beyond standard prefill. We show that reconstruction-based KV importance consistently concentrates on a subset of attention heads, largely independent of the input context. Based on this observation, we propose a hybrid KV cache compression method that combines context-independent head pre-pruning with token-level reconstruction-based pruning. By restricting expensive reconstruction scoring to selected heads, our method significantly reduces compression overhead. Experiments on long-context benchmarks demonstrate up to a 36\% overhead reduction while largely preserving inference accuracy.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 56
Loading