MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

ICLR 2026 Conference Submission22201 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Direct Policy Optimization, Large Language Models
TL;DR: MASS-DPO casts multi-negative DPO as a D-optimal design problem to actively pick a small, maximally informative negative set, improving alignment efficiency and accuracy with theory-backed guarantees and empirical gains.
Abstract: Multi-negative preference optimization under the Plackett–Luce (PL) model extends Direct Preference Optimization (DPO) by leveraging comparative signals across one preferred and multiple rejected responses. However, optimizing over large pools of negatives is computationally prohibitive, and many candidates contribute redundant gradients due to their similar effects on policy updates. To address this, we introduce \textbf{MASS-DPO}, which derives the Fisher information matrix directly from the PL objective and shows that the problem of selecting negatives naturally reduces to a D-optimal design formulation. This formulation guarantees maximal informativeness and comprehensive coverage of the current policy’s weaknesses. Moreover, the log-determinant criterion underlying D-optimal design admits a submodular structure, which we exploit through an incremental greedy algorithm that provides the natural computational realization of D-optimality, combining scalability with theoretical rigor. This incremental greedy strategy efficiently resolves the combinatorial complexity inherent in selecting a D-optimal negative set from large candidate pools. We establish convergence guarantees and finite-sample error bounds under this framework, and empirically demonstrate that MASS-DPO improves optimization efficiency and enhances downstream performance, achieving stronger alignment with substantially fewer negatives.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22201
Loading