Keywords: LLM, Reinforcement Learning
TL;DR: We provide the first theoretical analysis of the generalization and optimization of Group Relative Policy Optimization.
Abstract: Group Relative Policy Optimization (GRPO)~\citep{shao2024deepseekmath,guo2025deepseek} has rapidly become a critic-free default for aligning LLMs, yet its statistical and computational foundations remain unclear. We close this gap by providing the first unified theory of GRPO that simultaneously addresses generalization and optimization in the original, practitioner-used formulation and over multiple outer iterations. On the generalization side, we derive sequential (multi-iteration) PAC-Bayes–Bernstein bounds under Markov mixing that concentrate the \emph{empirical GRPO surrogate} around its population counterpart across all iterations; a Transformer path-norm corollary yields substantially tighter capacity terms than spectral norms. We further prove a TRPO-style return bridge showing that ascent in the population GRPO surrogate provably improves true return, with explicit, controllable bias from clipping and KL regularization. On the optimization side, we establish non-PL \emph{stationarity} guarantees for SGDM and AdamW (both $\tilde O(1/\sqrt{K})$) and provide complementary PL-based rates, with variance controlled by $t_{\mathrm{mix}}/(G\sqrt{K})$. Together with interactive information-theoretic lower bounds, our results deliver the first end-to-end, multi-iteration statistical and computational guarantees for GRPO with function approximation. Experiments corroborate the predicted trends and offer practical guidance on group size, clipping, and KL weight; code will be released.
Primary Area: reinforcement learning
Submission Number: 12896
Loading