Keywords: Efficient Reasoning, LLM, Test-time compute, Math, Coding
Abstract: Reasoning-optimized language models increasingly rely on test-time compute (TTC)—long chains of thought before final answers—to boost accuracy, but this raises cost because causal self-attention scales quadratically in time and linearly in memory with sequence length. We observe that many intermediate thoughts are redundant: the model rarely needs to attend to all past tokens to generate effective next tokens and reach correct solutions. We propose RollingWindowReasoner, a simple yet effective inference-time technique that maintains only the first window (preserving critical problem context) and the last window (recent reasoning steps) of the key-value cache. Experiments across two model families and three reasoning domains—math reasoning, code generation, and academic QA—demonstrate that RollingWindowReasoner achieves similar accuracy with only 50\% of the KV-cache budget, corresponding to 2$\times$ memory savings and 4$\times$ compute reduction.
Submission Number: 123
Loading