It Takes Two: Your GRPO Is Secretly DPO

Yihong Wu; Liheng Ma; Lei Ding; Muzhi Li; Xinyu Wang; Kejia Chen; Zhan Su; Zhanguang Zhang; Chenyang Huang; Yingxue Zhang; Mark Coates; Jian-Yun Nie

It Takes Two: Your GRPO Is Secretly DPO

Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, Jian-Yun Nie

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER Workshop SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Reinforcement Learning

TL;DR: Your GRPO Is Secretly DPO

Abstract: Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO)—a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only $1/8$ of the rollouts and reducing training time by over $70\\%$.

Submission Number: 236

Loading