Keywords: Preference Optimization, Model Alignment, Group Contrastive Loss, Multi-Preference Optimization, RLHF
TL;DR: We propose a Multi-Preference Optimization loss to jointly align using several responses with human preferences and get state-of-the-art results.
Abstract: Direct Preference Optimization (DPO) has proven effective in aligning large language models with human preferences but is often constrained to pairwise comparisons - overlooking additional positive and negative responses that are commonly available in real-world settings. We propose _Simultaneous Weighted Preference Optimization_ (SWEPO), which incorporates multiple responses per query and prioritizes those that deviate most from the average reward. This deviation-based weighting focuses training on the most informative outliers, akin to a built-in curriculum. Theoretically, we prove that such multi-preference sampling lowers alignment bias, bounding the expected deviation from the true acceptable-response distribution at a rate of O(1/sqrt(k)). Empirically, SWEPO outperforms state-of-the-art baselines on the Ultra-Feedback dataset and demonstrates substantial improvements over DPO and InfoNCA, yielding boosts of up to ~4% on length-controlled win-rate on AlpacaEval.
Submission Type: Long Paper (9 Pages)
Archival Option: This is a non-archival submission
Submission Number: 25
Loading