e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Published: 10 Jun 2025, Last Modified: 10 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, reasoning, test-time compute, RL, exploration
TL;DR: We analyze three key ingredients to teach LLMs to explore in-context and improve performance when we extrapolate test-time compute beyond what the LLMs are trained for.
Abstract: Test-time scaling offers a promising path to improve LLM reasoning; however, the true promise of this paradigm lies in extrapolation (i.e., to scale performance as LLMs "think" for longer). We show that one way to enable extrapolation is by training the LLM at in-context exploration; that is, training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.). To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining asymmetries in base LLM competence, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging negative gradients from incorrect traces to amplify exploration that chains additional asymmetries; and (3) aligning task difficulty with training token budget to structure in-context exploration. Our recipe e3 produces the best performing 1.7B model on AIME/HMMT'25, and can also extrapolate compute to 2.5x the model training budget.
Submission Number: 32
Loading