Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

ACL ARR 2026 January Submission3318 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning LLM, RLVR, Low Probability Token, Exploration

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. While previous methods attempt to maintain high entropy, we argue that unselective entropy maximization risks amplifying irrelevant noise rather than fostering meaningful exploration. In this paper, we identify a deeper issue: the gradual elimination of valuable low-probability exploratory tokens, which we term reasoning sparks, driven by RLVR over-penalization. To address this, we introduce Low-probability Regularization (Lp-Reg). Leveraging the statistical distinction where reasoning sparks exhibit higher probabilities than noise, Lp-Reg constructs a filtered, re-normalized proxy distribution. By penalizing deviations from this proxy via forward KL divergence, our method selectively shields these valuable tokens from elimination. Experiments demonstrate that Lp-Reg enables stable on-policy training for over $3,000$ steps (81,204 GPU-hours), sustaining exploration in regimes where baselines typically collapse. Validated across extensive evaluations totaling over 300,000 cumulative GPU-hours, Lp-Reg consistently achieves state-of-the-art performance across diverse model families, sizes, and domains, with relative accuracy improvements ranging from $3.06\%$ to $7.98\%$.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Language Modeling, Machine Learning for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 3318

Loading