AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Junkang Wu; Xue Wang; Zhengyi Yang; Jiancan Wu; Jinyang Gao; Bolin Ding; Xiang Wang; Xiangnan He

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Aligning large language models (LLMs) with human preferences requires balancing policy optimization with computational stability. While recent offline methods like DPO and SimPO bypass reinforcement learning’s complexity, they face critical limitations: DPO relies on static reference models that degrade with policy updates, and SimPO assumes a uniform target reward margin that ignores instance-wise preference strength. We propose AlphaDPO, an adaptive preference optimization framework that dynamically reparameterizes the reference distribution to address these issues. Our key innovation lies in an implicit reference model \(\hat{\pi}_{\text{ref}} \propto U(y|x)(\pi_\theta/\pi_{\text{ref}})^\alpha\), which interpolates between policy-driven specialization and uniform exploration while enabling instance-adaptive reward margins. Theoretically, we prove AlphaDPO implicitly controls sequential KL divergence between iterative policy updates, ensuring stability even with poorly calibrated reference models. Empirically, AlphaDPO achieves state-of-the-art performance on AlpacaEval 2 (58.7\% LC win rate) and Arena-Hard (35.7\% win rate) across Mistral2-7B, Llama3-8B, and Gemma2-9B, demonstrating robust alignment without multi-stage training. Our work establishes adaptive reference reparameterization as a principled mechanism for preference optimization.

Lay Summary: AI models, like chatbots, are being trained to better understand and follow human instructions and preferences. However, current training methods can sometimes be inflexible. Some rely on fixed 'guidance' that doesn't adapt as the AI learns new things, while others treat every piece of human feedback as equally important, potentially missing subtle differences in what users prefer. Our new method, AlphaDPO, acts like a more dynamic and personalized 'teacher' for these AI models. It continuously refines its teaching approach based on the AI's progress and also recognizes that some user preferences might be stronger or more critical to get right than others. This adaptive way of teaching helps the AI learn human values more effectively and reliably. As a result, AlphaDPO helps create AI assistants that are better aligned with user expectations and have shown top-level performance in evaluations.

Primary Area: Deep Learning->Large Language Models

Keywords: Direct Preference Optimization, LLM's alignment

Submission Number: 11970

Loading