Stable Preference Optimization: Learning preference is more important than imitation

Stable Preference Optimization: Learning preference is more important than imitation

ICLR 2026 Conference Submission16253 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Alignment, DPO, RLHF, IPO

TL;DR: leanring preference is importanter than imitation

Abstract: Direct Preference Optimization (DPO; \citet{rafailov2023direct}) is a widely used method for aligning large language models (LLMs) with human feedback. However, its objective often leads to reward hacking, where the model suppresses the probabilities of both preferred responses ($y_w$) and dispreferred responses ($y_l$). In such cases, the model fails to capture the underlying preference signal, effectively treating both outcomes as undesirable. Subsequent methods, while addressing this issue, introduce their own shortcomings. Kahneman-Tversky Optimization (KTO; \citet{ethayarajh2024kto}) overly pursues imitation of preferred examples at the expense of learning the true preference margin. Meanwhile, the symmetric squared loss in Identity Policy Optimization (IPO; \citet{azar2023general}) fails to distinguish between a large positive log-probability difference (indicating a correctly learned preference) and a large negative one (indicating a pathologically inverted preference), penalizing both extremes equally.To address this cascade of challenges, we propose Stable Preference Optimization (SPO). At its core is a novel loss function designed to: 1) prevent reward hacking by establishing a stable, finite optimization target; 2) focus on the preference margin rather than pure imitation, balancing learning across responses; and 3) corrects IPO's loss imbalance with an asymmetric design by using the function $f(z) = -z e^{-z}$. For this loss, when the positive log-probability difference is higher than an initial point, the loss is lower than at the initial point; when the positive log-probability difference is lower than the initial point, the loss is higher than at the initial point, while simultaneously being a convex function that possesses a unique minimum.Our method provides a unified solution to the core drawbacks of DPO, KTO, and IPO. Experimental results demonstrate significant improvements in both alignment performance and training stability.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 16253

Loading