Keywords: direct preference optimization, online DPO, multi-armed bandit
TL;DR: We study convergence rates of (online) DPO from optimization perspective, and show the impact of samplers through a theoretical separation and empirical experiments.
Abstract: Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment.
Despite its empirical success, the *optimization* properties, particularly the impact of samplers on its convergence rates, remain underexplored. In this paper, we provide a rigorous analysis of DPO's *convergence rates* with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves *linear* convergence, while our proposed online sampler achieves *quadratic* convergence.
We further adapt the sampler to practical settings by incorporating posterior distributions and *logit mixing*, demonstrating significant improvements over previous approaches.
On Safe-RLHF dataset, our method exhibits a $9.5$% improvement over vanilla DPO and on-policy DPO; on Iterative-Prompt, our approach outperforms vanilla DPO and hybrid GSHF by over $9.5$%.
Our results not only offer insights into the theoretical standing of DPO but also pave the way for potential algorithm designs in the future.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1735
Loading