A2SF: Accumulative Attention Score with Forgetting Factor

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, KV Cache Compression, Natural Language Processing, Inference, Machine Learning, RL
TL;DR: A2SF uses accumulative attention with forgetting to prune tokens, lowering LLM memory costs and preserving accuracy across diverse tasks.
Abstract: Transformer-based large language models (LLMs) face memory bottlenecks during inference due to key-value (KV) caches. The direction of recent researches has been to identify and discard redundant tokens. Existing approaches often rely on attention scores to remove low-contributing tokens. However, because they adopt a fixed observation window size, they fail to guarantee input-level optimality and frequently suffer performance degradation across different environments. To overcome these limitations, we propose A2SF (Accumulative Attention Score with Forgetting Factor), a token selection method that applies a forgetting factor to accumulative attention scores. A2SF defines a accumulative formula that gradually forgets past attention contributions over time, thereby generalizing fixed-window approaches. Furthermore, it employs reinforcement learning to dynamically explore appropriate forgetting factors for each sentence. Experimental results demonstrate that by dynamically adapting to the characteristics of each input, A2SF more effectively selects critical tokens, thereby achieving 4–9\% performance improvements over competitive baselines under a highly constrained token budget.
Primary Area: optimization
Submission Number: 23020
Loading