MAPO: MIXED ADVANTAGE POLICY OPTIMIZATION

Wenke Huang; QUAN ZHANG; Yiyang Fang; Jian Liang; Xuankun Rong; Huanjin Yao; Guancheng Wan; KE LIANG; Wenwen He; JIAN LI; Leszek Rutkowski; Mang Ye; Bo Du; Dacheng Tao

MAPO: MIXED ADVANTAGE POLICY OPTIMIZATION

Wenke Huang, QUAN ZHANG, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, KE LIANG, Wenwen He, JIAN LI, Leszek Rutkowski, Mang Ye, Bo Du, Dacheng Tao

02 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Group Relative Policy Optimization, Advantage Reversion, Advantage Mirror

TL;DR: We propose the mixed advantage strategy for GRPO to deal with advantage reversion and advantage mirror problems.

Abstract: Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, **M**ixed **A**dvantage **P**olicy **O**ptimization (**MAPO**). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 1171

Loading