Keywords: RLHF Theory, LLM Alignment
Abstract: Reinforcement Learning with Human Feedback (RLHF) has achieved great success
in aligning large language models (LLMs) with human preferences. Prevalent
RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In
this paper, we explore RLHF under a general preference framework and approach
it from a game-theoretic perspective. Specifically, we formulate the problem as
a two-player game and propose a novel online algorithm, iterative Nash policy
optimization (INPO). The key idea is to let the policy play against itself via no-
regret learning, thereby approximating the Nash policy. Unlike previous methods,
INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead,
we introduce a new loss objective that is directly minimized over a preference
dataset. We provide theoretical analysis for our approach and demonstrate its
effectiveness through experiments on various representative benchmarks. With an
LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win
rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial
improvement over the state-of-the-art online RLHF algorithms.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11848
Loading