Keywords: Direct Preference Optimization (DPO), Gradient Imbalance, Length Bias Mitigation, Offline Preference Optimization, Adaptive Loss Function
TL;DR: AdaDPO corrects a structural gradient imbalance in DPO by using stop-gradient coefficients to rebalance updates on preferred versus dispreferred responses through a 4-line, drop-in loss code change, improving length-controlled win rate.
Abstract: Direct Preference Optimization (DPO) is an offline decision-making algorithm: a policy is trained entirely from a fixed preference dataset, with no online interaction. Recent theoretical analysis uncovers an asymmetric gradient pathology in DPO that misallocates this offline learning signal: the loss suppresses dispreferred responses substantially faster than it promotes preferred ones, so the model predominantly learns to avoid bad answers rather than to generate good ones. We propose AdaDPO, a Self-Adaptive variant of the Direct Preference Optimization algorithm that introduces per-preference-pair, stop-gradient-based coefficients derived directly from the policy model's generation probabilities. AdaDPO is constructed to balance gradient magnitudes between preferred and dispreferred probabilities; the practical implementation balances per-token gradients and applies a numerical clipping bound for stability, while retaining DPO's original hyperparameter structure. In preliminary experiments on Llama-3-8B-Instruct trained on UltraFeedback, AdaDPO consistently outperforms DPO on AlpacaEval 2: it achieves higher length-controlled win rates (LC) in 81% of hyperparameter combinations and enlarges the LC-over-WR margin in 88% of combinations, indicating effective mitigation of length bias. Because it operates purely at the loss level, AdaDPO is a drop-in correction for preference-based alignment pipelines.
Submission Number: 109
Loading