Keywords: alignment, preference learning, RLHF, win rate, language model
TL;DR: We provide a win-rate centric framework to unify disparate methods in preference learning.
Abstract: The surging interest in learning from preference data has resulted in an elaborate landscape of methods and evaluations.
This work offers a framework to simplify this landscape, starting from the underlying sampling distribution for preference data.
First, we show that the only evaluation of a generative model that is grounded in the preference data sampling distribution is win rate.
Given that win rate is all that can matter from preference data alone, we relate common preference learning algorithms to direct win rate optimization (DWRO). We outline the theoretical benefits of RLHF as a variant of DWRO; explain why checkpointing is difficult with DPO as a non-DWRO objective; and characterize the limits of SFT on preferred samples with regard to the extent of win rate improvement possible.
Furthermore, we provide closed-form expressions for the expected win rate improvement of the above objectives, formalizing the role of a model's starting point in the win rate improvement possible. Finally, we conduct an empirical analysis of existing methods and alternative DWRO objectives which suggests that optimization improvements are likely key to advancing preference learning.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12482
Loading