Evolving Alignment via Asymmetric Self-Play

Ziyu Ye; Rishabh Agarwal; Tianqi Liu; Rishabh Joshi; Sarmishta Velury; Qijun Tan; Yuan Liu

Evolving Alignment via Asymmetric Self-Play

Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury, Qijun Tan, Yuan Liu

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model, RLHF, open-ended learning, alignment

TL;DR: We present a new framework for open-ended self-training large language models.

Abstract: Current RLHF approaches for aligning large language models (LLMs) typically assume a fixed prompt distribution, which is sub-optimal and limits the generalization capabilities for language models. To address this issue, we introduce a general framework that casts alignment as an asymmetric game between two players: (i) a creator, which strategically generates informative prompt distributions using reward signals, and (ii) a solver, which learns to produce preferred responses on prompts produced by the creator. This framework of Evolving Alignment via Asymmetric Self-Play (`eva`), results in a simple and efficient approach that can utilize any existing RLHF algorithm. `eva` achieves a new state of the art in widely adopted alignment benchmarks, without the need of any additional human crafted prompts, e.g., it can improve the win rate of finetuned gemma-2-9b-it on Arena-Hard from 51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7% with SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and matching Claude-3-opus. Finally, we show `eva` is effective and robust under various ablation settings. We hope `eva` can serve as a scalable and easy-to-use methodology for the research community to build open-ended, robust, and self-improving language agents, that align with human values.

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12746

Loading