Keywords: Large Language Models, Reinforcement Learning, Group Relative Preference Optimization
Abstract: Recent breakthroughs in Large Language Model (LLM) reasoning have been driven by reinforcement learning techniques like PPO and GRPO. However, in Reinforcement Learning with Verifiable Rewards (RLVR), sparse rewards hinder learning when group samples receive identical scores. While existing methods attempt to address this with data filtering, they inadvertently limit progress on correctly answered prompts. Additionally, reward models based on absolute numerical scores often suffer from range instability, undermining training stability. To address these issues, we introduce intra-group response preference ranking as a reward signal. We propose the Ranking Reward Model (RRM), a listwise preference model designed for GRPO, which outputs relative preference rankings for multiple responses to a single prompt. RankGRPO incorporates three strategies to leverage these rankings, mitigating vanishing gradients and instability from absolute scoring. Experiments show that RankGRPO improves performance across RLVR benchmarks, open-ended tasks, and reward model evaluations. RRM, trained with limited data, outperforms traditional numerical reward models trained on larger datasets, demonstrating the potential of RankGRPO and the effectiveness of ranking-based reward signals. Our source code is available at https://anonymous.4open.science/r/RankGRPO-0542.
Primary Area: reinforcement learning
Submission Number: 973
Loading