e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Published: 12 Jun 2025, Last Modified: 25 Jun 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Language Modeling
Keywords: LLM, reasoning, test-time compute, RL, exploration
TL;DR: We analyze three key ingredients to teach LLMs to explore in-context and improve performance when we extrapolate test-time compute beyond what the LLMs are trained for.
Abstract: Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., performance continues to improve on hard problems as LLMs keep "thinking" for longer, much beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate. We show that one way to enable extrapolation is by training the LLM at in-context exploration; that is, training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), and testing multiple hypotheses before it can commit to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining asymmetries in base LLM competence, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leverage negative gradients from incorrect traces to amplify exploration that chains additional asymmetries, resulting in longer search traces during RL; and (3) align task difficulty with training token budget to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME/HMMT'25 scores, which can also extrapolate compute to 2.5x the model training budget.
Serve As Reviewer: ~Matthew_Y._R._Yang1, ~Amrith_Setlur1
Submission Number: 30
Loading