Keywords: reinforcement learning, policy optimization, momentum
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising framework to enhance the reasoning capabilities of Large Language Models (LLMs), yet the samples from the policy model are not fully exploited during training. We propose Momentum-Aware Policy Optimization (MAPO), a critic-free, drop-in framework that preserves the simplicity of GRPO while improving exploration and stability. MAPO introduces (i) a Momentum Group Baseline that yields non-vanishing learning signals under group-standardized rewards; (ii) confidence-based prioritized replay that reuses verified successes to increase sample efficiency; and (iii) entropy-weighted token updates that concentrate gradient mass on uncertain decision points. Evaluated on math reasoning benchmarks, MAPO outperforms strong baselines—including GRPO and DAPO—in best-of-$N$ accuracy (pass@$N$), demonstrating superior exploration and discovery of correct reasoning trajectories. Ablation studies attribute the primary gains to the momentum advantage, which reduces the steps required to reach the target, alleviates stalls on homogeneous reward groups, and reduces across-seed variance. The replay and entropy components provide complementary improvements in sample utilization and gradient allocation. Overall, MAPO achieves target performance in fewer optimization steps while maintaining training stability, offering a practical enhancement to group-based RLVR methods.
Primary Area: reinforcement learning
Submission Number: 16633
Loading