BPO: Revisiting Preference Modeling in Direct Preference Optimization

18 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Direct Preference Optimization
TL;DR: Enhance preference modeling for direct preference optimization by dynamically balancing the optimization of chosen and rejected responses
Abstract: Direct Preference Optimization (DPO) have emerged as a popular method for aligning LLMs with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct and +11.7% with Qwen2.5-Math-7B. It also surpasses DPO variants by +3.6% over IPO , +5.0% over SLiC, and +3.1% over Cal-DPO on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.
Primary Area: reinforcement learning
Submission Number: 11430
Loading