LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning

LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning

ACL ARR 2026 January Submission9673 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, KV Cache Eviction, Efficient Long Reasoning

Abstract: Large Language Models (LLMs) exhibit enhanced capabilities by Chain-of-Thought reasoning. However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a **Token Importance Recurrence** phenomenon: a large proportion of tokens regain high attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose **LazyEviction**, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache by 50%~70% while maintaining comparable accuracy, outperforming existing KV cache baselines.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: pruning, LLM Efficiency, NLP in resource-constrained settings

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 9673

Loading