Reasoning Cache: Learning to Extrapolate to Long Lengths via Short-Length RL

Ian Wu; Yuxiao Qu; Amrith Setlur; Aviral Kumar

Reasoning Cache: Learning to Extrapolate to Long Lengths via Short-Length RL

Ian Wu, Yuxiao Qu, Amrith Setlur, Aviral Kumar

Published: 05 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop RSI PosterEveryoneRevisionsCC BY 4.0

Keywords: LLM, reasoning, reinforcement learning, test time compute, extrapolation

TL;DR: We develop a method to train reasoning models to recursively improve their reasoning over long horizons.

Abstract: Large Language Models (LLMs) that continue improving at test-time budgets far beyond their training budgets can solve harder problems by leveraging additional inference compute: we refer to this property as extrapolation. Standard on-policy RL operates on fixed problem distributions and training budgets, giving rise to a train-test distribution shift that limits the model's extrapolation capabilities. To address this, we introduce RC, an iterative decoding algorithm replacing standard autoregressive decoding that enables models to extrapolate to lengths an order of magnitude longer than those seen during training. RC exploits the asymmetry between summarization and generation capabilities present in LLMs to construct a decoding process that improves consistently over iterations. Its effectiveness can be further increased through training, which amplifies the model’s ability to perform summary-conditioned reasoning while avoiding the challenges of long-horizon RL. Training a 4B instruction-following model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to 70% when evaluated with a 512k-token test budget, substantially surpassing comparably sized LLMs.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 46

Loading