Keywords: Reward Modeling, RLHF, Direct Preference Optimization
Abstract: Reward models play a crucial role in the post-training of large language models (LLMs). While explicit reward models are widely used, implicit approaches like Direct Preference Optimization (DPO; Rafailov et al., 2023) offer an alternative. However, implicit models often exhibit weaker generalization and performance [citation needed]. In this work, we investigate why DPO underperforms compared to explicit reward models. We first demonstrate that generating high-quality answers is generally more difficult than discriminating between good and bad answers, providing an intuitive explanation for DPO’s weaker generalization, since it directly learns generation rather than discrimination. Further, we show that the DPO objective requires greater model capacity to fit effectively, suggesting that the learning task itself is more challenging. Crucially, because DPO operates by directly optimizing token-level probabilities, the combination of large vocabulary sizes and long-tail token distributions leads to inefficient learning dynamics that ultimately degrade model performance. To address this, we propose a simple modification to DPO’s formulation: removing the (log)-softmax function, which improves the implicit reward model’s effectiveness. This adjustment can enhance algorithms like PRIME, which rely on implicit reward modeling.
Submission Number: 1
Loading