Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients

Cheng Ge; Caitlyn Heqi Yin; Hao Liang; Jiawei Zhang

Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients

Cheng Ge, Caitlyn Heqi Yin, Hao Liang, Jiawei Zhang

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning in language models, Reinforcement Learning

Abstract: Reinforcement learning (RL) has become a key driver of language model reasoning. Among RL algorithms, Group Relative Policy Optimization (GRPO) is the de facto standard, avoiding the need for a critic by using per-prompt baselines and variance normalization. Yet, the role of normalization remains unclear. In this work, we provide an explanation through the lens of local curvature of the sequence-level policy gradient. We show that standard deviation normalization implements an adaptive gradient, improving convergence when curvature varies across prompts and across iterations. Furthermore, empirical studies on synthetic tasks and GSM8K confirm that normalization consistently improves stability and convergence, especially on harder problems with high reward variance. By establishing the connection between normalization and adaptive gradient, we provide a theoretical foundation for the empirical success of GRPO and offers broader insights into the design of critic-free RL algorithms for LLM training.

Submission Number: 290

Loading