MAPO: Momentum-Aware Policy Optimization

MAPO: Momentum-Aware Policy Optimization

ICLR 2026 Conference Submission16633 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, policy optimization, momentum

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising framework to enhance the reasoning capabilities of Large Language Models (LLMs), yet the samples from the policy model are not fully exploited during training. We propose Momentum-Aware Policy Optimization (MAPO), a critic-free, drop-in framework that preserves the simplicity of GRPO while improving exploration and stability. MAPO introduces (i) a Momentum Group Baseline that yields non-vanishing learning signals under group-standardized rewards; (ii) confidence-based prioritized replay that reuses verified successes to increase sample efficiency; and (iii) entropy-weighted token updates that concentrate gradient mass on uncertain decision points. Evaluated on math reasoning benchmarks, MAPO outperforms strong baselines—including GRPO and DAPO—in best-of-$N$ accuracy (pass@$N$), demonstrating superior exploration and discovery of correct reasoning trajectories. Ablation studies attribute the primary gains to the momentum advantage, which reduces the steps required to reach the target, alleviates stalls on homogeneous reward groups, and reduces across-seed variance. The replay and entropy components provide complementary improvements in sample utilization and gradient allocation. Overall, MAPO achieves target performance in fewer optimization steps while maintaining training stability, offering a practical enhancement to group-based RLVR methods.

Primary Area: reinforcement learning

Submission Number: 16633

Loading