Learn to Explore In-Context via Reinforcement Learning

Futing Wang; Jianhao Yan; Yun Luo; Ganqu Cui; Zhi Wang; Xiaoye Qu; Yue Zhang; Yu Cheng; Tao Lin

Learn to Explore In-Context via Reinforcement Learning

Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu Cheng, Tao Lin

Published: 17 Jun 2026, Last Modified: 23 Jun 2026ICML 2026 AI4Math Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test Time Scaling; LLM

Abstract: Achieving effective test-time scaling requires models to perform \emph{in-context exploration} --- the ability to generate, evaluate, and refine multiple reasoning hypotheses within a single trajectory. However, how to quantify and incentivize such exploration during reinforcement learning remains unclear. In this work, we propose a principled view of in-context exploration through \emph{state coverage}, measuring the diversity of abstract reasoning states visited during generation. While directly optimizing state coverage is intractable, we show that trajectory length provides a simple and effective proxy for expanding exploration capacity. However, naively encouraging longer reasoning leads to degenerate behaviors such as repetition. To address this, we propose Length-Incentivized Non-redundant Exploration (\method), a reward shaping approach that jointly incentivizes longer trajectories and penalizes redundant patterns. Experiments across multiple models and benchmarks show that \method consistently improves reasoning performance and leads to more diverse reasoning trajectories, resulting in stronger test-time scaling behavior. On Qwen3-4B-Base, \method improves average mathematical reasoning accuracy by 4.4 points over strong RL baselines, while also improving out-of-domain generalization.

Supplementary Material: zip

Submission Number: 73

Loading