Keywords: Preference Alignment; DPO; Curriculum Learning
TL;DR: We propose a simple method that improves LLM alignment by sorting preference pairs by difficulty and applying SFT to hard cases instead of discarding them.
Abstract: Preference learning constitutes a fundamental component in aligning large language models (LLMs) with human values and ethical expectations, where the quality of preference data plays a critical role. Existing methods typically assess data quality by measuring the margin between preferred and dispreferred responses in each pair. Following the common intuition that small-margin (i.e., difficult) pairs are uninformative or even noisy, such pairs are often discarded. In this work, we challenge this natural practice and propose a new insight: “While difficult pairs may hinder alignment when optimized with preference-based objectives due to potential likelihood displacement, they can still provide valuable learning signals when trained with supervised fine-tuning (SFT).” We empirically validate this insight through systematic experiments and highlight two key findings: (1) Structuring training from easy to difficult samples improves alignment performance, consistent with the curriculum learning paradigm; (2) Difficult pairs negatively impact preference-based optimization but become useful when optimized using SFT loss. Based on this insight, we introduce a simple yet effective method, MixDPO, which ranks preference pairs by difficulty and dynamically switches to SFT loss for difficult pairs. Our approach achieves improved alignment performance on the AlpacaEval 2 benchmark, outperforming existing DPO variants, particularly for the Length Control (LC) win rate.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14129
Loading