TL;DR: The paper motivates and develops a win rate-centric framework to understand preference learning, with takeaways for current practice and future research.
Abstract: Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective's solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.
Lay Summary: When we train AI models, it’s important to make sure they do what people actually want. The primary paradigm for doing this is to ask humans to choose between several model-generated options and use this information to continue to train models. This is called preference learning. Compared to other tasks like training models to classifying images or generate realistic text, preference learning is still not well understood.
This research introduces a clearer way to think about preference learning. We show that win rate — how often a model generates content preferred over a competitor — is the only evaluation metric that respects the properties of the preference data being collected. This gives us a solid foundation to study preference learning.
We divide preference-learning methods into two groups: those that aim to directly improve win rate (which we call WRO methods) and those that don’t (non-WRO methods). We introduce new WRO methods, explain why they’re theoretically stronger, and show that common techniques like SFT (supervised fine-tuning) or DPO (direct preference optimization) miss out on these strengths—though we offer ideas to make them better.
Interestingly, even though WRO methods are better in theory, they often perform worse in practice because they are harder to optimize. Our findings suggest that improving how these methods are trained may matter more than choosing the perfect method on paper.
Overall, our work helps researchers better understand what works in preference learning and offers guidance on how to improve preference learning to get AI systems to better match what people actually want.
Primary Area: Social Aspects->Alignment
Keywords: preference learning, alignment, RLHF
Submission Number: 7100
Loading