Abstract: Reinforcement learning (RL) methods such as Group Relative Policy Optimization (GRPO) have recently emerged as a leading approach for enhancing the reasoning ability of large language models (LLMs). Yet, the precise sources of their effectiveness remain unclear. In this work, we systematically decompose GRPO by benchmarking it against simpler REINFORCE-style baselines to identify its core components. Our analysis reveals a clear hierarchy: (i) iterative, online data collection is the dominant driver of performance, enabling even simple positive-only fine-tuning (e.g., RAFT) to be surprisingly strong; (ii) negative signals primarily sustain exploration by preventing rapid entropy collapse; and (iii) GRPO’s main benefit stems not from reward normalization itself, but from the implicit data filtering effect it induces by discarding prompts with uniform rewards (all-correct or all-incorrect). Guided by this insight, we propose REINFORCE-Rej, a minimal variant that makes filtering explicit. REINFORCE-Rej matches GRPO’s performance while being simpler and more KL-efficient. These findings suggest that principled data filtering, rather than algorithmic complexity, is the key to robust RL for LLMs.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Zhongwen_Xu1
Submission Number: 6490
Loading