Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization

Rui Wang; Qianguo Sun; Chao Song; Junlong Wu; Tianrong Chen; Zhiyun Zeng; Yu Li

Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization

Rui Wang, Qianguo Sun, Chao Song, Junlong Wu, Tianrong Chen, Zhiyun Zeng, Yu Li

18 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, DPO, LPO

TL;DR: LPO tackles DPO's overfitting and collapse via gradient decoupling, enhanced stability, and tunable rejection suppression.

Abstract: Direct Preference Optimization (DPO) is a widely adopted offline preference optimization algorithm, valued for its simplicity and training stability. However, it is susceptible to overfitting and performance collapse. To overcome these limitations, we introduce Linear Preference Optimization (LPO), a novel alignment framework that incorporates three key innovations. First, we achieve gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, isolating the optimization dynamics more effectively. Second, we enhance training stability by incorporating an offset constraint and a positive regularization term, ensuring consistent response quality. Third, we implement controllable rejection suppression through gradient separation, which features a straightforward estimation process and a tunable coefficient to regulate the rate of rejection probability descent. Extensive experiments demonstrate that LPO consistently outperforms DPO across diverse tasks, including general text processing, mathematics, text-to-speech (TTS), and automatic speech recognition (ASR). These findings establish LPO as a robust, versatile, and tunable paradigm for preference alignment.

Primary Area: reinforcement learning

Submission Number: 11256

Loading