Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

18 Sept 2025 (modified: 04 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning LLM, RLVR, Low Probability Token, Exploration
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key phenomenon: the gradual elimination of what we term \textbf{\textit{reasoning sparks}}: a crucial subset of low-probability tokens such as ``wait'', that initiate diverse reasoning paths. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by first applying a probability threshold to filter out noise tokens and then re-normalizing the distribution over the remaining candidates. This process effectively shields the exploratory tokens from destructive updates. Experiments show that Lp-Reg enables stable on-policy training for around 3,000 steps over 81,204 GPU-Hours, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17\%$ average accuracy on five math benchmarks, an improvement of $2.66\%$ over prior methods.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12394
Loading