TAKE: Task-Aware Chunked KV Cache Eviction for Efficient Long-Context LLM Prefill

Long Hu; Nan Jia; Rui Wang; Jiahui Li; Qingyi Yang; Yixue Hao; Xianzhi Li; Xiaofei Liao

TAKE: Task-Aware Chunked KV Cache Eviction for Efficient Long-Context LLM Prefill

Long Hu, Nan Jia, Rui Wang, Jiahui Li, Qingyi Yang, Yixue Hao, Xianzhi Li, Xiaofei Liao

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Efficient Generative Inference, Key-Value Cache

TL;DR: A training-free framework for prefill-stage KV cache optimization that achieves extremely low GPU memory usage.

Abstract: The rapid development of large language models (LLMs) enhances various language generation applications, but it remains a serious memory usage challenge in long-context inference. Existing global pruning aims to reduce memory in the decoding process, ignoring the prefill peaks to delay the time-to-first-token. In this paper, we present Task-Aware Chunked KV Cache Eviction (TAKE), a training-free framework to optimize KV cache memory during the prefill stage of LLM inference. TAKE partitions long sequences into chunks and incrementally performs task-aware KV fusion and eviction, thereby avoiding full-sequence processing and reducing memory and compute overhead. To preserve task-relevant information, we introduce lightweight task-aware probe tokens to identify salient tokens within each chunk and accumulate semantic information across chunks. Furthermore, we propose a delayed eviction strategy that protects shallow transformer layers from early pruning, mitigating representation degradation and improving performance stability. Extensive experimental results show that TAKE achieves superior performance, reduces the peak GPU memory usage for the KV cache and activation to about 8.9\% of the baseline model, and lowers first-token latency by over 60\% for sequences up to 128k tokens. It also enables stable inference with arbitrary length contexts on 24GB consumer GPUs without quantization or KV offloading, while maintaining model quality. Our code is available at https://anonymous.4open.science/r/TAKE-6B21.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 5880

Loading