Evolving Alignment via Asymmetric Self-Play

Published: 30 Oct 2024, Last Modified: 13 Dec 2024LanGame SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-play, large language models, RLHF, preference fine-tuning, open-ended learning, alignment
TL;DR: We present a new framework for open-ended self-training large language models.
Abstract: Current RLHF approaches for aligning large language models (LLMs) typically assume a fixed prompt distribution, which is sub-optimal and limits the generalization capabilities for language models. To address this issue, we introduce a general framework that casts alignment as an asymmetric game between two players: (i) a creator, which strategically generates informative prompt distributions using reward signals, and (ii) a solver, which learns to produce preferred responses on prompts produced by the creator. This framework of Evolving Alignment via Asymmetric Self-Play (`eva`), results in a simple and efficient approach that can utilize any existing RLHF algorithm. eva achieves a new state of the art in widely adopted alignment benchmarks, without the need of any additional human crafted prompts, e.g., it can improve the win rate of finetuned gemma-2-9b-it on Arena-Hard from 51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7% with SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and matching Claude-3-opus. Finally, we show eva is effective and robust under various ablation settings. We hope `eva` can serve as a scalable methodology for the research community to build open-ended, robust, and self-improving language agents, that align with human values.
Submission Number: 23
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview