Rethinking KV Cache Pruning with Channel Interdependence for Efficient Long-Context Inference

Rethinking KV Cache Pruning with Channel Interdependence for Efficient Long-Context Inference

ACL ARR 2025 May Submission6099 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Managing the extensive Key-Value (KV) cache is critical for efficient long-context processing in Large Language Models (LLMs). Conventional channel pruning techniques for KV cache typically assess each channel in isolation, neglecting the interdependencies among channels. Accordingly, we introduce an $\textbf{I}$nterdependence-$\textbf{A}$ware KV Cache $\textbf{P}$runing (IAP) method, moving beyond the conventional paradigm of isolated channel scoring. Specifically, we first analyze the existence of inter-channel interactions, then reformulate channel selection objective with the channel interdependence component, and propose a graph-based algorithm to identify channels for pruning. Furthermore, IAP mitigates the challenge of query distribution shifts during decoding by strategically retaining high-magnitude key channels. Extensive experiments on LongBench with LLaMA and Mistral models demonstrate that IAP marked improvements in preserving model performance post-pruning compared to established baselines, offering a more robust approach to KV cache compression.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Keywords: Low-Resource Methods; LLM; Model Compression

Submission Number: 6099

Loading