Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Yueqin Yin; Zhendong Wang; Yujia Xie; Weizhu Chen; Mingyuan Zhou

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Yueqin Yin, Zhendong Wang, Yujia Xie, Weizhu Chen, Mingyuan Zhou

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Fine-tuning, Self-play

TL;DR: Our paper introduces Self-Augmented Preference Optimization (SAPO), a dynamic, scalable training paradigm that outperforms traditional methods by autonomously generating negative responses and integrating real-time data updates.

Abstract: Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which restricts their adaptability and practical applicability. To address this limitation, we introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm without the need of existing paired data. Built upon the self-play concept that autonomously generate negative responses, we further involve the off-policy learning pipeline to improve the data exploration and exploitation. Specifically, we employ an Exponential Moving Average (EMA) model along with a replay buffer to enable dynamic updates of response segments, effectively integrating real-time feedback with historical data insights. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B models across benchmarks—including the Open LLM Leaderboard, IFEval, AlpacaEval 2.0, and MT-Bench—demonstrate that SAPO matches or surpasses established offline contrastive baselines, such as DPO and Odds Ratio Preference Optimization (ORPO), and outperforms offline self-play methods like SPIN.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8397

Loading