Keywords: large language models, reasoning models, reinforcement learning, RLVR, exploration, unlearning
TL;DR: We propose EEPO, which enhances exploration in RLVR by temporarily suppressing sampled trajectories during rollouts, achieving 10-33% improvements across mathematical reasoning benchmarks.
Abstract: Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, reduced exploratory capacity, and ultimately limited performance gains. Although techniques that add randomness increase policy stochasticity, they frequently fail to escape dominant behavioral modes. The resulting sample-and-reward dynamics amplify these modes, eroding exploration and leading to entropy collapse. We introduce Exploration-Enhanced Policy Optimization (EEPO), a novel framework that promotes exploration through two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight, temporary unlearning step to suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism actively steers the policy away from dominant modes and encourages mode-seeking exploration. Across five reasoning benchmarks, EEPO consistently outperforms baselines, achieving average gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23319
Loading