Keywords: Reasoning in language models, Reinforcement Learning
Abstract: Group Relative Policy Optimization, a critic-free, per-prompt REINFORCE-style method with within-prompt standardization, reliably stabilizes RL for LLM reasoning, but the mechanism is unclear. We show that within-prompt reward variance estimates the local curvature of the sequence-level policy gradient, so standard-deviation normalization implements a prompt-wise adaptive step size. Under a mild orthogonality assumption we prove faster convergence than unnormalized REINFORCE, and validate the effect on synthetic tasks and GSM8K.
Submission Number: 229
Loading