Keywords: Reasoning in language models, Reinforcement Learning
Abstract: Reinforcement learning (RL) has become a key driver of language model reasoning. Among RL algorithms, Group Relative Policy Optimization (GRPO) is the de facto standard, avoiding the need for a critic by using per-prompt baselines and variance normalization. Yet, the role of normalization remains unclear. In this work, we provide an explanation through the lens of local curvature of the sequence-level policy gradient. We show that standard deviation normalization implements an adaptive gradient, improving convergence when curvature varies across prompts and across iterations. Furthermore, empirical studies on synthetic tasks and GSM8K confirm that normalization consistently improves stability and convergence, especially on harder problems with high reward variance. By establishing the connection between normalization and adaptive gradient, we provide a theoretical foundation for the empirical success of GRPO and offers broader insights into the design of critic-free RL algorithms for LLM training.
Submission Number: 290
Loading