{
  "title": "GRPO vs. DPO: Fundamental Differences",
  "turns": [
    {
      "turn": 1,
      "user": "What's the fundamental reason GRPO is better than DPO?",
      "assistant": "The fundamental reason GRPO (Group Relative Policy Optimization) shows better performance than DPO (Direct Preference Optimization) lies in **how it leverages the information in the training data**.\n\nIn short, while DPO relies on the \\*\\*1:1 pair (Pairwise comparison)\\*\\* of 'preferred/unpreferred', GRPO simultaneously compares 'one preferred response' with \\*\\*'a group of multiple unpreferred responses (Groupwise comparison)'\\*\\*, learning a richer and more stable signal.\n\nHere are the key differences between the two approaches and the specific advantages of GRPO:\n\n### 1\\. How DPO (Direct Preference Optimization) works and its limitations\n\n*   **How it works**: DPO uses a dataset of the form `(prompt, chosen_response, rejected_response)`. That is, for a given prompt, it trains on a pair consisting of the response humans preferred more (`chosen`) and the one they preferred less (`rejected`). The model's objective is directly optimized to increase the generation probability of the `chosen` response and decrease that of the `rejected` response. This is an efficient way to update the policy directly via preference pairs without explicitly training a reward model.\n*   **Fundamental limitation**: **Loss of information in the data (Information Loss)**\n    *   If you have one `chosen` response and multiple `rejected` responses (e.g., rejected\\_1, rejected\\_2, rejected\\_3), DPO must split them into multiple independent 1:1 pairs like `(chosen, rejected_1)`, `(chosen, rejected_2)`, `(chosen, rejected_3)` for training.\n    *   Because each pair is handled independently, information about relations among `rejected` responses or the overall preference distribution is not effectively used during training. This can make the training signal unstable or biased.\n\n### 2\\. GRPO (Group Relative Policy Optimization): solution and advantages\n\n*   **How it works**: To overcome DPO's limitations, GRPO uses `(prompt, chosen_response, {rejected_response_1, rejected_response_2, ...})`—that is, it handles one preferred response and a **group of multiple unpreferred responses** at once.\n*   **Fundamental superiority**: **Strengthened training signals via groupwise comparison (Groupwise Comparison & Signal Enhancement)**\n    1.  **Data efficiency and increased information (Higher Data Efficiency)**: GRPO extracts more information from the same data. It learns not only that the `chosen` response is better than a single `rejected` response, but that it is, on average, better than the entire `rejected` group. This yields a much stronger and more comprehensive training signal.\n    2.  **Stable gradients (More Stable Gradients)**: In 1:1 comparisons, training can be strongly perturbed by the noise or idiosyncrasies of a particular `rejected` sample. In contrast, GRPO computes the `advantage` by considering the average characteristics of multiple `rejected` samples. By comparing each response's reward to the group's average reward to assess relative superiority, the noise of individual samples is canceled out, enabling updates that are much more stable and consistent.\n    3.  **Robustness to diverse negative examples (Robustness to Diverse Negative Samples)**: Unpreferred responses can be bad for different reasons. DPO learns each reason individually, whereas GRPO generates effective gradients that push the model away from the entire 'unpreferred region' that includes diverse types of bad responses.\n\n### Summary\n\n| Feature  | **DPO (Direct Preference Optimization)** | **GRPO (Group Relative Policy Optimization)** |\n| --- | --- | --- |\n| **Comparison unit** | **Pairwise (Pairwise)**: `(Chosen, Rejected)` | **Groupwise (Groupwise)**: `(Chosen, {Rejected_Group})` |\n| **Use of information** | Limited. Learns only that `Chosen` is better than a single `Rejected`. | Comprehensive. Learns that `Chosen` is better than the entire `Rejected` group. |\n| **Training signal** | Can be unstable, sensitive to noise in individual samples. | Averages over the group to cancel noise, yielding a stable and strong signal. |\n| **Data efficiency** | Relatively low. | High. Extracts richer information from the same amount of data. |\n\nIn conclusion, GRPO extends DPO's `Pairwise` learning paradigm to `Groupwise`, making a fundamental improvement that leverages the information inherent in preference data far more efficiently and robustly. As a result, it delivers more stable and superior performance, especially when there is a complex and diverse set of candidate responses."
    },
    {
      "turn": 2,
      "user": "So, apart from how the reward is given, are there no implementation differences?",
      "assistant": "No. Beyond just how the reward is given, there are fundamental differences in the **core algorithmic structure and optimization objective (Loss Function)**.\n\nRather than being an extension of DPO, GRPO is closer to a **variant of PPO (Proximal Policy Optimization) without a value function (Value Function)**. The main implementation differences are as follows.\n\n| Category  | **DPO (Direct Preference Optimization)** | **GRPO (Group Relative Policy Optimization)** |\n| --- | --- | --- |\n| **Core optimization objective (Loss Function)** | Uses a **log-likelihood-based loss function**. It directly optimizes to increase the log-probability of the preferred response (`chosen`) and decrease the log-probability of the less-preferred response (`rejected`). | Uses a **PPO-like policy-gradient-based loss function**. It computes the relative `Advantage` within the group and optimizes by modulating the policy ratio (new policy / old policy) using that advantage. |\n| **Need for a reward model (Reward Model)** | **Not required.** The preference/non-preference pair data itself serves as an implicit reward signal. | **Generally required.** A separate reward model is used to score each response (reward) in the group. |\n| **Value function (Value/Critic Function)** | **Not used.** | **Not used.** Instead of PPO's complex value function (critic), it simplifies the structure by using \\*\\*the group's mean reward (mean reward)\\*\\* as a baseline (baseline) to compute the `Advantage`. |\n| **Data handling** | Primarily uses static `(chosen, rejected)` pair data. | During training, it **dynamically samples response groups** from the current policy (policy). |\n\n* * *\n\n### Key takeaways\n\n*   **DPO** treats preference data like a kind of **classification (Classification) problem**, combining reward modeling and reinforcement learning into a single **'direct optimization'** approach.\n*   **GRPO** keeps the framework of \\*\\*'reinforcement learning (RL)'\\*\\* while replacing PPO's key component, the value function (critic), with an efficient \\*\\*'group-relative evaluation'\\*\\*.\n\nTherefore, beyond how they define the reward signal, the **optimization mechanisms themselves are fundamentally different** in how they use that signal to update the model's policy (policy)."
    }
  ]
}