Accurate KV Cache Eviction via Anchor Direction Projection for Efficient LLM Inference

Zijie Geng; Jie Wang; Ziqi Liu; Feng Ju; Yiming Li; Xing Li; Mingxuan Yuan; Jianye HAO; Defu Lian; Enhong Chen; Feng Wu

Accurate KV Cache Eviction via Anchor Direction Projection for Efficient LLM Inference

Zijie Geng, Jie Wang, Ziqi Liu, Feng Ju, Yiming Li, Xing Li, Mingxuan Yuan, Jianye HAO, Defu Lian, Enhong Chen, Feng Wu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, KV Cache Eviction, LLM Inference Acceleration

TL;DR: We propose a projection-based scoring function in KV cache eviction for LLM acceleration.

Abstract: Key-Value (KV) cache eviction---which retains the KV pairs of the most important tokens while discarding less important ones---is a critical technique for optimizing both memory usage and inference latency in large language models (LLMs). However, existing approaches often rely on simple heuristics---such as attention weights---to measure token importance, overlooking the spatial relationships between token value states in the vector space. This often leads to suboptimal token selections and thus performance degradation. To tackle this problem, we propose a novel method, namely **AnDPro** (**An**chor **D**irection **Pro**jection), which introduces a projection-based scoring function to more accurately measure token importance. Specifically, AnDPro operates in the space of value vectors and leverages the projections of these vectors onto an *``Anchor Direction''*---the direction of the pre-eviction output---to measure token importance and guide more accurate token selection. Experiments on $16$ datasets from the LongBench benchmark demonstrate that AnDPro can maintain $96.07\\%$ of the full cache accuracy using only $3.44\\%$ KV cache budget, reducing KV cache budget size by $46.0\\%$ without compromising quality compared to previous state-of-the-arts.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 27280

Loading