Keywords: Test Time Scaling; LLM
Abstract: Achieving effective test-time scaling requires models to perform \emph{in-context exploration} --- the ability to generate, evaluate, and refine multiple reasoning hypotheses within a single trajectory.
However, how to quantify and incentivize such exploration during reinforcement learning remains unclear.
In this work, we propose a principled view of in-context exploration through \emph{state coverage}, measuring the diversity of abstract reasoning states visited during generation.
While directly optimizing state coverage is intractable, we show that trajectory length provides a simple and effective proxy for expanding exploration capacity.
However, naively encouraging longer reasoning leads to degenerate behaviors such as repetition.
To address this, we propose Length-Incentivized Non-redundant Exploration (\method), a reward shaping approach that jointly incentivizes longer trajectories and penalizes redundant patterns.
Experiments across multiple models and benchmarks show that \method consistently improves reasoning performance and leads to more diverse reasoning trajectories, resulting in stronger test-time scaling behavior.
On Qwen3-4B-Base, \method improves average mathematical reasoning accuracy by 4.4 points over strong RL baselines, while also improving out-of-domain generalization.
Supplementary Material: zip
Submission Number: 73
Loading