Provably Efficient and Practical Self-Play for Better LLM Alignment

Yibo Wang; Zikun Zhang; Zhihan Liu; Shenao Zhang; Zhaoran Wang

Provably Efficient and Practical Self-Play for Better LLM Alignment

Yibo Wang, Zikun Zhang, Zhihan Liu, Shenao Zhang, Zhaoran Wang

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, two player game, DPO, Sample-efficient RLHF

Abstract:

Reinforcement Learning with Human Feedback (RLHF) has gained significant attention for aligning AI behavior with human preferences. Self-play style RLHF has shown strong advantages, as highlighted by many studies. However, current self-play style RLHF approaches face several limitations, including the lack of provable sample efficiency, absence of active exploration, and limited diversity in training data. To address these challenges, we propose a novel RLHF framework that balances exploration and exploitation while providing theoretical guarantees. We introduce Two-Agent Nash Policy Optimization (TANPO) as an equivalent and easy-to-implement two-agent algorithm building on this framework. In TANPO, the two players are trained using different loss functions to ensure more diverse and informative data collection. We also propose Single-Agent Diversity-driven Optimization (SADPO), a single-agent approximation of TANPO, supported by both theoretical analysis and empirical evidence. Our theoretical analysis shows that our theoretical algorithm framework enjoys sublinear regret under general function approximation and mild structural conditions, with a detailed analysis provided for the linear case. Empirically, we implement TANPO and SADPO using Zephyr-7B-SFT as our base model, outperforming several baselines across multiple evaluation benchmarks, such as AlpacaEval 2.0, MT-Bench and various standard academic benchmarks. Our experiments also show that TANPO improves performance on AlpacaEval 2.0 over extended training epochs, demonstrating its ability to consistently improve and reduce overfitting.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5277

Loading