Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

Taneesh Gupta; Rahul Madhavan; Xuchao Zhang; Nagarajan Natarajan; Chetan Bansal; Saravan Rajmohan

Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, Saravan Rajmohan

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Preference Optimization, Direct Preference Optimization, DPO, Multi-Preference Optimization, MPO, Policy Optimization, Reinforcement Learning from Human Feedback, RLHF

TL;DR: We introduce Multi-Preference Optimization (MPO), a novel method that generalizes DPO to efficiently learn from sets of preferred and dispreferred responses, achieving state-of-the-art LLM post-training optimization performance

Abstract: Modern post-training pipelines for LLMs frequently involve on-policy generation to produce multiple candidate responses per prompt. However, popular alignment methods like Direct Preference Optimization (DPO) are restricted to pairwise comparisons, discarding valuable supervisory signal. In this setting, we propose Multi-Preference Optimization (MPO), a generalization of DPO that optimizes over entire sets of selected and rejected responses. This set-level contrastive approach is theoretically grounded: we first prove that leveraging $n$ responses achieves a $\mathcal{O}\bigl(\tfrac{1}{\sqrt{n}}\bigr)$ convergence in TV-distance to the true preference distribution. We then prove, under a formal model with spacing-scaled Gaussian noise ($\Delta, \sigma = \mathcal{O}(1/n)$), that MPO's 2-bin partition reliability remains bounded away from zero, in contrast to full-ranking methods which degrade exponentially ($\exp(-\mathcal{O}(n))$). To further enhance learning, MPO employs a deviation-based weighting, which emphasizes outlier responses to induce an implicit curriculum. Empirically, as we show over multiple models and benchmarks, MPO achieves state-of-the-art performance, with an improvement of up to $\sim 17.5$\% WR on AlpacaEval2 in the on-policy iterative setting, and state-of-the-art results in off-policy settings.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11871

Loading