Keywords: preference optimization, stable, large language model
TL;DR: We introduce StaPO, a two-sided contrastive objective that bounds the contrastive logit with dual margins, preventing the instability seen in existing methods.
Abstract: Offline preference optimization has proven effective for aligning large language models (LLMs). However, existing methods often suffer from objective misalignment, which drives models toward yielding degenerate language patterns (i.e. nonsensical tokens and incoherent phrases) with moderately extended fine-tuning. In this paper, we propose Stable Preference Optimization (StaPO), a novel method designed to address this challenge. We first unify existing offline preference optimization approaches under a one-sided contrastive (OsC) learning framework, showing that OsC inherently maximizes the contrastive logit—the average or summed log-probability difference between preferred and dispreferred responses—without proper constraints. This unconstrained maximization of the contrastive logit, can gradually erode the LLM's core linguistic functionality. StaPO mitigates this via a two-sided contrastive (TsC) learning framework with dual-margin constraints. The left margin, akin to the OsC-based methods, ensures effective preference learning, while the right margin limits excessive growth of the contrastive logit, thereby preventing the collapse of the well-trained language system. Empirical evaluations conducted on standard benchmarks, such as AlpacaEval2, Arena-Hard, and MT-Bench, highlight significant improvements achieved by StaPO compared to OsC-based methods. While StaPO consistently maintains stable win rates and entropy levels across multiple finetuning epochs, OsC-based methods show abnormally increasing or decreasing language entropy and deteriorating performance. These benefits of StaPO are consistently observed across diverse model architectures, including both base and instruction-tuned architectures like Mistral (7B) and Llama 3 (8B).
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11993
Loading