Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients

Cheng Ge; Caitlyn Heqi Yin; Hao Liang; Jiawei Zhang

Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients

Cheng Ge, Caitlyn Heqi Yin, Hao Liang, Jiawei Zhang

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning in language models, Reinforcement Learning

Abstract: Group Relative Policy Optimization, a critic-free, per-prompt REINFORCE-style method with within-prompt standardization, reliably stabilizes RL for LLM reasoning, but the mechanism is unclear. We show that within-prompt reward variance estimates the local curvature of the sequence-level policy gradient, so standard-deviation normalization implements a prompt-wise adaptive step size. Under a mild orthogonality assumption we prove faster convergence than unnormalized REINFORCE, and validate the effect on synthetic tasks and GSM8K.

Submission Number: 229

Loading