DGPO: Mitigating Likelihood Displacement with Bidirectional KL Divergence Gap

Taihang Zhen; Fanyu Meng; Boyan Wang; Jiaheng Liu; Jing Huo; Yang Gao; Xi Yang; Chao Deng; Junlan Feng

DGPO: Mitigating Likelihood Displacement with Bidirectional KL Divergence Gap

Taihang Zhen, Fanyu Meng, Boyan Wang, Jiaheng Liu, Jing Huo, Yang Gao, Xi Yang, Chao Deng, Junlan Feng

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Alignment, Likelihood Displacement, Gradient Entanglement

Abstract: The current margin-based model alignment method, represented by Direct Preference Optimization (DPO), aims to expand the margin between chosen and rejected responses. However, some works point out the log-probability of chosen response always decreases, thus affecting the likelihood of its generation. This likelihood displacement caused by gradient entanglement is a failure mode for preference optimization and has not been fully resolved. In this paper, we focus on forward and reverse Kullback-Leibler (KL) divergence on the probability distribution of preference pairs to form Divergence Gap Preference Optimization (DGPO). We prove DGPO can promote the increase of the chosen log-probability. Besides, DGPO also maintains a lightweight and automatic manner in real-world alignment. The downstream experimental results demonstrate that DGPO maintains competitive performance across various mainstream benchmarks without the reference model and additional key hyperparameters.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 13993

Loading