Keywords: Large Language Models, Alignment, Nash Equilibrium
TL;DR: We propose a novel framework for self-play alignment with regularization.
Abstract: Self-play alignment has emerged as an effective approach for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. To study the impact of different regularization strategies, we propose **Regularized Self-Play Policy Optimization (RSPO)**, a novel framework that unifies prior methods and enables simple plug-and-play regularizers, meanwhile preserving convergence to Nash equilibrium of the corresponding regularized game. We observe that RSPO with appropriate regularizers can substantially improve the length-controlled win rate (LCWR) on AlpacaEval-2 across a range of base models, while also achieving consistently superior performance on Arena-Hard, MT-Bench, ArmoRM, and response diversity. In particular, RSPO improves unregularized self-play baseline (SPPO) on AlpacaEval-2 LCWR from $28.5\\%$ to $ 35.4\\%$ with base model Mistral-7B, from $38.77\\%$ to $43.66\\%$ with LLaMA-8B, and from $50.54\\%$ to $51.83\\%$ with Gemma-2B. Combining simplicity, convergence guarantees, and significant empirical gains, RSPO offers a strong foundation for exploring regularized self-play in language model alignment.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12928
Loading