Keywords: reinforcement learning, DPO, LPO
TL;DR: LPO tackles DPO's overfitting and collapse via gradient decoupling, enhanced stability, and tunable rejection suppression.
Abstract: Direct Preference Optimization (DPO) is a widely adopted offline preference optimization algorithm, valued for its simplicity and training stability. However, it is susceptible to overfitting and performance collapse. To overcome these limitations, we introduce Linear Preference Optimization (LPO), a novel alignment framework that incorporates three key innovations. First, we achieve gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, isolating the optimization dynamics more effectively. Second, we enhance training stability by incorporating an offset constraint and a positive regularization term, ensuring consistent response quality. Third, we implement controllable rejection suppression through gradient separation, which features a straightforward estimation process and a tunable coefficient to regulate the rate of rejection probability descent. Extensive experiments demonstrate that LPO consistently outperforms DPO across diverse tasks, including general text processing, mathematics, text-to-speech (TTS), and automatic speech recognition (ASR). These findings establish LPO as a robust, versatile, and tunable paradigm for preference alignment.
Primary Area: reinforcement learning
Submission Number: 11256
Loading