Keywords: Preference Optimization, Direct Preference Optimization, DPO, Multi-Preference Optimization, MPO, Policy Optimization, Reinforcement Learning from Human Feedback, RLHF
TL;DR: We introduce Multi-Preference Optimization (MPO), a novel method that generalizes DPO to efficiently learn from sets of preferred and dispreferred responses, achieving state-of-the-art LLM post-training optimization performance
Abstract: Modern post-training pipelines for LLMs frequently involve on-policy generation to produce multiple candidate responses per prompt. However, popular alignment methods like Direct Preference Optimization (DPO) are restricted to pairwise comparisons, discarding valuable supervisory signal. In this setting, we propose Multi-Preference Optimization (MPO), a generalization of DPO that optimizes over entire sets of selected and rejected responses. This set-level contrastive approach is theoretically grounded: we first prove that leveraging $n$ responses achieves a $\mathcal{O}\bigl(\tfrac{1}{\sqrt{n}}\bigr)$ convergence in TV-distance to the true preference distribution. We then prove, under a formal model with spacing-scaled Gaussian noise ($\Delta, \sigma = \mathcal{O}(1/n)$), that MPO's 2-bin partition reliability remains bounded away from zero, in contrast to full-ranking methods which degrade exponentially ($\exp(-\mathcal{O}(n))$). To further enhance learning, MPO employs a deviation-based weighting, which emphasizes outlier responses to induce an implicit curriculum. Empirically, as we show over multiple models and benchmarks, MPO achieves state-of-the-art performance, with an improvement of up to $\sim 17.5$\% WR on AlpacaEval2 in the on-policy iterative setting, and state-of-the-art results in off-policy settings.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11871
Loading