Abstract: Large language models (LLMs) fine-tuned with alignment methods, such as reinforcement learning from human feedback, have been used to develop some of the most capable AI systems to date. Despite their success, existing methods typically rely on simple binary labels, such as those indicating preferred outputs in pairwise preferences. This overlooks the varying relative quality between pairs, preventing models from capturing these subtleties. To address this limitation, we consider settings in which this information (i.e., margin) can be derived and propose a straightforward generalization of common optimization objectives used in alignment methods. The approach, which we call Margin Matching Preference Optimization (MMPO), integrates per-feedback margin to enhance optimization, making it more robust to overfitting and resulting in better LLM policies and reward models. Specifically, given quality margins in pairwise preferences, we design soft target probabilities based on the Bradley-Terry model, which are then used to train models with the standard cross-entropy objective. Our experiments with both human and AI feedback data demonstrate that MMPO can outperform baseline methods, often by a substantial margin, on popular benchmarks, including MT-bench and RewardBench. Notably, the 7B model trained with MMPO achieves state-of-the-art performance on RewardBench compared to competing models at the same scale, as of June 2024. Our analysis further demonstrates that MMPO is more robust to overfitting, leading to better-calibrated models.
Paper Type: Long
Research Area: Generation
Research Area Keywords: text-to-text generation, domain adaptation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 143
Loading