Abstract: Managing the extensive Key-Value (KV) cache is critical for efficient long-context processing in Large Language Models (LLMs). Conventional channel pruning techniques for KV cache typically assess each channel in isolation, neglecting the interdependencies among channels. Accordingly, we introduce an $\textbf{I}$nterdependence-$\textbf{A}$ware KV Cache $\textbf{P}$runing (IAP) method, moving beyond the conventional paradigm of isolated channel scoring. Specifically, we first analyze the existence of inter-channel interactions, then reformulate channel selection objective with the channel interdependence component, and propose a graph-based algorithm to identify channels for pruning. Furthermore, IAP mitigates the challenge of query distribution shifts during decoding by strategically retaining high-magnitude key channels. Extensive experiments on LongBench with LLaMA and Mistral models demonstrate that IAP marked improvements in preserving model performance post-pruning compared to established baselines, offering a more robust approach to KV cache compression.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Keywords: Low-Resource Methods; LLM; Model Compression
Submission Number: 6099
Loading