Keywords: Large Language Models, Reinforcement Learning, Math Reasoning
Abstract: Group Relative Policy Optimization (GRPO) uses the group's average reward as a baseline, eliminating the need for a value model and substantially boosting large language models (LLMs) reasoning. However, vanilla GRPO assigns uniform weight to all rollout samples for relative advantage, overlooking their generation probability information. This can result in inaccurate estimation of relative advantage, especially with limited rollouts, leading to suboptimal performance. To address this limitation, we propose Probability Weighted Policy Optimization (PWPO), which explicitly incorporates the sample generation probability into the calculation of relative advantage. This probability-aware training mechanism enables the dynamic adjustment of each sample's weight based on its generation probability. Experimental results on five mathematical reasoning benchmarks demonstrate the superiority of our method.
Submission Number: 10
Loading