Game-Theoretic Regularized Self-Play Alignment of Large Language Models

Xiaohang Tang; Sangwoong Yoon; Seongho Son; Huizhuo Yuan; Quanquan Gu; Ilija Bogunovic

Game-Theoretic Regularized Self-Play Alignment of Large Language Models

Xiaohang Tang, Sangwoong Yoon, Seongho Son, Huizhuo Yuan, Quanquan Gu, Ilija Bogunovic

Published: 08 Mar 2025, Last Modified: 08 Mar 2025SSI-FM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-play, LLM alignment, Game theory

TL;DR: This paper introduce a novel framework for applying different regularization for self-play alignment methods.

Abstract: Self-play alignment algorithms have been developed as effective methods for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. In this paper, we show that our regularization method can improve the unregularized self-play significantly. To study the impact of different regularization in self-play alignment, we propose Regularized Self-Play Policy Optimization (RSPO), a generalized framework that allows for regularizing self-play by simply adding a chosen regularization term into the loss, while maintaining provable last-iterate convergence to the Nash Equilibrium of the corresponding regularized game. Surprisingly, empirical evaluations using the Mistral-7B-Instruct base model reveal that forward KL divergence regularization reduces response length in RSPO, whereas reverse KL divergence markedly improves raw win rates. RSPO with a linear combination of forward and reverse KL divergence regularization substantially increase the length-controlled win rate in AlpacaEval-2, elevating the unregularized self-play alignment method (SPPO) from $28.53\\%$ to $35.44\\%$. Finally, we show that RSPO also improves the response diversity.

Submission Number: 26

Loading