Not All Thoughts Matter: Selective Attention for Efficient Reasoning

Hao Tang; Guoqing Zheng; Kanishk Gandhi; Harkirat Behl; Vaishnavi Shrivastava; Mojan Javaheripi; Kevin Ellis; Shivam Garg; Dimitris Papailiopoulos

Not All Thoughts Matter: Selective Attention for Efficient Reasoning

Hao Tang, Guoqing Zheng, Kanishk Gandhi, Harkirat Behl, Vaishnavi Shrivastava, Mojan Javaheripi, Kevin Ellis, Shivam Garg, Dimitris Papailiopoulos

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Reasoning, LLM, Test-time compute, Math, Coding

Abstract: Reasoning-optimized language models increasingly rely on test-time compute (TTC)—long chains of thought before final answers—to boost accuracy, but this raises cost because causal self-attention scales quadratically in time and linearly in memory with sequence length. We observe that many intermediate thoughts are redundant: the model rarely needs to attend to all past tokens to generate effective next tokens and reach correct solutions. We propose RollingWindowReasoner, a simple yet effective inference-time technique that maintains only the first window (preserving critical problem context) and the last window (recent reasoning steps) of the key-value cache. Experiments across two model families and three reasoning domains—math reasoning, code generation, and academic QA—demonstrate that RollingWindowReasoner achieves similar accuracy with only 50\% of the KV-cache budget, corresponding to 2$\times$ memory savings and 4$\times$ compute reduction.

Submission Number: 123

Loading