everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Reinforcement Learning with Human Feedback (RLHF) has gained significant attention for aligning AI behavior with human preferences. Self-play style RLHF has shown strong advantages, as highlighted by many studies. However, current self-play style RLHF approaches face several limitations, including the lack of provable sample efficiency, absence of active exploration, and limited diversity in training data. To address these challenges, we propose a novel RLHF framework that balances exploration and exploitation while providing theoretical guarantees. We introduce Two-Agent Nash Policy Optimization (TANPO) as an equivalent and easy-to-implement two-agent algorithm building on this framework. In TANPO, the two players are trained using different loss functions to ensure more diverse and informative data collection. We also propose Single-Agent Diversity-driven Optimization (SADPO), a single-agent approximation of TANPO, supported by both theoretical analysis and empirical evidence. Our theoretical analysis shows that our theoretical algorithm framework enjoys sublinear regret under general function approximation and mild structural conditions, with a detailed analysis provided for the linear case. Empirically, we implement TANPO and SADPO using Zephyr-7B-SFT as our base model, outperforming several baselines across multiple evaluation benchmarks, such as AlpacaEval 2.0, MT-Bench and various standard academic benchmarks. Our experiments also show that TANPO improves performance on AlpacaEval 2.0 over extended training epochs, demonstrating its ability to consistently improve and reduce overfitting.