Keywords: LLMs, Reinforcement Learning
TL;DR: Your GRPO Is Secretly DPO
Abstract: Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs).
It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead.
In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning,
which reveals a fundamental connection to Direct Preference Optimization (DPO).
Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO)—a configuration previously deemed infeasible.
We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO,
despite using only $1/8$ of the rollouts and reducing training time by over $70\\%$.
Submission Number: 236
Loading