3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Yuzi Yan; Yibo Miao; Jialian Li; YipinZhang; Jian Xie; Zhijie Deng; Dong Yan

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Yuzi Yan, Yibo Miao, Jialian Li, YipinZhang, Jian Xie, Zhijie Deng, Dong Yan

Published: 22 Jan 2025, Last Modified: 11 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, DPO, RLHF

TL;DR: understanding the degradation of reward-free alignment in LLMs

Abstract: Aligning large language models (LLMs) with human preferences has gained significant attention, with Proximal Policy Optimization (PPO) as a standard yet computationally expensive method and Direct Preference Optimization (DPO) as a more efficient alternative. While DPO offers simplicity, it remains underutilized in state-of-the-art LLMs, suggesting potential limitations. In this work, we revisit DPO, analyzing its theoretical foundations and empirical performance to bridge this gap. We identify three key properties—termed \textbf{3D}-properties—that emerge from DPO’s learning process: \textbf{D}rastic drop in rejected response likelihood, \textbf{D}egradation into response suppression, and \textbf{D}ispersion effect on unseen responses. We show that these issues arise from DPO’s optimization dynamics, where the interaction between chosen and rejected response gradients leads to instability. Our findings are supported by experiments on both a controlled toy model and real-world LLM tasks, including mathematical problem-solving and instruction following. To address these challenges, we propose simple regularization techniques that improve training stability and performance. Additionally, we examine how preference data distribution impacts DPO’s effectiveness, offering insights into how alignment models handle out-of-domain (OOD) data. Our work connects these observations to broader research and provides a theoretical explanation for DPO’s limitations. We hope these insights will guide future advancements in reward-model-free preference learning, bringing it closer to reward-model-based approaches.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5344

Loading