ProtoKV: A Hybrid Semantic Prototype-based Framework for Efficient KV Cache Compression

ACL ARR 2025 May Submission934 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Key-Value (KV) caching accelerates LLM inference but incurs high memory overhead. Existing methods focus on preserving only critical KV pairs for inference. While clustering-based strategies excel preserve critical KV pairs with semantic coherence, they suffer from computational inefficiency and limited parallelization. In this paper, we identify a dichotomy in token representations: while most tokens exhibit semantic similarity to their surrounding tokens, a distinct subset deviating from this pattern exhibits clustered semantic embeddings in the latent space. Leveraging this, we propose ProtoKV, a novel KV cache compression framework that combines chunk-aware local aggregation and LSH-driven global consolidation to construct hybrid semantic prototypes. These prototypes guide head-wise attention redistribution via cluster-aware pooling, efficiently retaining critical KV pairs. Experiments on LongBench show ProtoKV achieves 2.11\% higher accuracy than state-of-the-art under identical memory constraints; in Needle-In-A-Haystack task, it achieves 96.8\% retrieval accuracy at 1.6\% cache retention. Furthermore, ProtoKV reduces inference latency by up to $3.9\times$ compared with clustering-based strategies.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: NLP in resource-constrained settings
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Keywords: NLP in resource-constrained settings
Submission Number: 934
Loading