Keywords: Large Language Models, KV Cache Eviction, LLM Inference Acceleration
TL;DR: We propose a projection-based scoring function in KV cache eviction for LLM acceleration.
Abstract: Key-Value (KV) cache eviction---which retains the KV pairs of the most important tokens while discarding less important ones---is a critical technique for optimizing both memory usage and inference latency in large language models (LLMs).
However, existing approaches often rely on simple heuristics---such as attention weights---to measure token importance, overlooking the spatial relationships between token value states in the vector space.
This often leads to suboptimal token selections and thus performance degradation.
To tackle this problem, we propose a novel method, namely **AnDPro** (**An**chor **D**irection **Pro**jection), which introduces a projection-based scoring function to more accurately measure token importance.
Specifically, AnDPro operates in the space of value vectors and leverages the projections of these vectors onto an *``Anchor Direction''*---the direction of the pre-eviction output---to measure token importance and guide more accurate token selection.
Experiments on $16$ datasets from the LongBench benchmark demonstrate that AnDPro can maintain $96.07\\%$ of the full cache accuracy using only $3.44\\%$ KV cache budget, reducing KV cache budget size by $46.0\\%$ without compromising quality compared to previous state-of-the-arts.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 27280
Loading