Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning in GRPO

TMLR Paper6651 Authors

25 Nov 2025 (modified: 30 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO)~\citep{Shao-2024-Deepseekmath}, has shown strong empirical results in training DeepSeek-R1~\citep{Guo-2025-Deepseek}. However, GRPO fails to update the policy when all responses within a group are incorrect (i.e., \emph{all-negative-sample} groups). This limitation underscores a key gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these signals. Our first contribution is to introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups using a \textit{step-wise} judge model, which can be either directly trained or adapted from existing LLMs. We prove that this diversification can accelerate GRPO’s learning dynamics in a simplified setting. We also empirically validate the proposed stepwise guided policy optimization (SGPO) method, demonstrating consistent gains across model sizes (7B, 14B, 32B) in offline and online training on 9 benchmarks, including base and distilled variants. Our results highlight two advantages: (i) SGPO surpasses GRPO, especially in the early and mid-training stages where all-negative-sample groups are prevalent; and (ii) SGPO does not require judge models to generate correct answers, differentiating it from knowledge distillation methods.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Marc_Lanctot1
Submission Number: 6651
Loading