Keywords: Preference Optimization, RLHF, LLM Alignment, Pairwise Learning to Rank, Reward Modeling
TL;DR: GapPO weights preference optimization gradients by annotation score gaps, concentrating learning on unambiguous comparisons and improving pairwise ranking accuracy across model families and datasets.
Abstract: Aligning large language models to human preferences requires training on pairwise comparisons between candidate responses. Existing preference optimization methods assign equal gradient weight to every pair, regardless of whether the quality difference is large or negligibly small. We introduce GapPO (Gradient-Adaptive Pairwise Preference Optimization), a preference optimization method designed to directly improve pairwise ranking accuracy in large language models. Standard methods currently treat all pairs equally: a pair scoring $4.8$ vs. $1.2$ receives the same gradient weight as one scoring $3.2$ vs. $2.9$, diluting clear signal with annotator noise. GapPO corrects this by weighting each pair by the absolute quality-score gap $|\delta| = |\texttt{score}{\text{chosen}} - \texttt{score}{\text{rejected}}|$, so that gradient mass concentrates on the most discriminative comparisons. Since the model is shaped more by reliable comparisons, its implicit reward function better separates high-quality from low-quality responses at test time. Beyond improving pairwise accuracy (PWA), score-gap weighting improves Spearman rank correlation between model rewards and annotation scores, which is the calibration property required to scale from pairwise to listwise ranking. Evaluated on UltraFeedback binarized across Qwen2.5-0.5B, Gemma-2-2B, and Mistral-7B, GapPO consistently outperforms SimPO, CPO, IPO, and AlphaPO baselines.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 100
Loading