['1c1', '< Title: ITERATIVE NASH POLICY OPTIMIZATION: ALIGNING LLMS WITH GENERAL PREFERENCES VIA NO-REGRET LEARNING', '---', '> Title: ITERATIVE NASH POLICY OPTIMIZATION (INPO): ALIGNING LLMS WITH GENERAL PREFERENCES', '3c3', '< Abstract: Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via noregret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.', '---', "> Abstract: Reinforcement Learning with Human Feedback (RLHF) has significantly advanced the alignment of large language models (LLMs) with human preferences. However, most existing RLHF methods rely on reward models built upon the Bradley-Terry (BT) assumption, which often oversimplifies the true complexity of human preferences. This paper introduces a novel game-theoretic framework for RLHF under general preferences, formulating the alignment problem as a two-player game. We propose Iterative Nash Policy Optimization (INPO), a new online algorithm designed to learn the Nash policy. INPO's core innovation lies in its self-play, no-regret learning approach, which approximates the Nash policy without requiring the computationally expensive estimation of individual response win rates. Instead, we introduce a novel loss objective directly minimized over a preference dataset. We provide rigorous theoretical analysis and demonstrate INPO's superior effectiveness through extensive experiments on various benchmarks. Using an LLaMA-3-8B SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, marking substantial improvements over state-of-the-art online RLHF algorithms.", '6,11c6', '< Large language models (LLMs) such as ChatGPT (Achiam et al., 2023), Claude (Anthropic, 2023), and Bard (Google, 2023) have achieved tremendous success in various instruction-following tasks.', '< A key factor in this success is the technique of reinforcement learning with human feedback (RLHF) (Christiano et al., 2017), which aligns LLMs with human preferences and values. The first standard RLHF framework for LLM alignment was proposed by Ouyang et al. (2022). They first train a reward model (RM) on a dataset containing human preferences. Subsequently, a pretrained LLM is fine-tuned to maximize the reward from this RM using the proximal policy optimization (PPO) algorithm (Schulman et al., 2017). Models trained with this pipeline can generate humanpreferred outputs even with 100x fewer parameters. Nevertheless, fitting a high-quality RM requires a large amount of human-labeled data, and training with PPO is generally less stable (Peng et al., 2023). To bypass the training of the RM, Rafailov et al. (2024) propose the direct preference optimization (DPO) algorithm, which directly learns a policy on a human preference dataset. Compared to RLHF with PPO, DPO is more stable and computationally lightweight.', '< However, the approaches mentioned above, which rely on either an explicit or implicit RM, assume that human preferences can be adequately modeled with the Bradley-Terry (BT) model (Bradley & Terry, 1952). We argue that the BT model cannot fully capture the complexity of human preferences. For example, the preference signal in the BT model is transitive, implying that if A is preferred to B and B is preferred to C, A must be preferred to C. This kind of transitive property may not always hold across diverse human groups and contradicts evidence in human decision-making (May, 1954;Tversky, 1969). In addition, experimental results show that the accuracy of BT-based RMs is about 70% (Bai et al., 2022c;Cui et al., 2023), while preference models outperform them by a clear margin (Ye et al., 2024). This motivates us to consider general preferences without the BT model assumption.', '< To achieve this goal, Munos et al. (2023) formulate the LLM alignment problem as a symmetric two-player game. One can show that for any other policy, the Nash policy of the game enjoys at least one half win rate, ignoring the KL regularization terms. Given the general preference oracle, Munos et al. (2023) propose a planning algorithm to solve for the Nash policy. In this paper, we consider the learning problem, where the general preference oracle is unknown to us, and we only assume access to query the oracle. Inspired by the connections between constant-sum games and online learning (Freund & Schapire, 1999), we propose using a no-regret learning algorithm to learn the Nash policy. The key idea originates from the self-play algorithms used in games, where the policy plays against itself to achieve self-improvement. Our contributions are summarized as follows.', '< Contributions. In this paper, we study RLHF for LLM alignment from a game-theoretic perspective. We propose a novel online algorithm called Iterative Nash Policy Optimization (INPO), which learns the Nash policy of a two-player game. Our approach is built on the classical no-regret learning algorithm, online mirror descent (OMD). Unlike previous studies that also explore online algorithms for learning the Nash policy (Rosset et al., 2024;Wu et al., 2024), our approach does not require calculation of the expected win rate for each response, which is difficult to estimate accurately and may incur high costs in practice. Instead, we propose a new loss objective and prove that the minimizer of this loss uniquely corresponds to our target policy in each iteration. Therefore, similar to (Rafailov et al., 2024;Azar et al., 2024), our approach directly learns the policy over a preference dataset by minimizing the loss objective.', '< We prove that our algorithm approximates Nash policy with an iteration complexity of O 1 ϵ 2 and achieves last-iterate convergence at a rate of O(1/T ). More importantly, our algorithm is easy to implement in practice, and we conduct experiments on several popular benchmarks to demonstrate its effectiveness. Remarkably, with an SFT model from LLaMA-3-8B, our INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 (Li et al., 2023a) and a 37.8% win rate on Arena-Hard v0.1 (Li et al., 2024), exhibiting at least 27.7% relative improvement over the state-of-the-art online RLHF algorithms (Dong et al., 2024;Wu et al., 2024).', '---', "> Large language models (LLMs) have demonstrated remarkable capabilities in various instruction-following tasks (Achiam et al., 2023; Anthropic, 2023; Google, 2023). A pivotal technique underpinning this success is Reinforcement Learning with Human Feedback (RLHF) (Christiano et al., 2017), which aligns LLMs with complex human preferences and values. The seminal RLHF framework (Ouyang et al., 2022) involves training a reward model (RM) on human preference data, followed by fine-tuning a pretrained LLM using Proximal Policy Optimization (PPO) (Schulman et al., 2017) to maximize this reward. While effective, this approach often necessitates extensive human-labeled data for a high-fidelity RM and can suffer from PPO's inherent training instabilities (Peng et al., 2023). To circumvent RM training, Direct Preference Optimization (DPO) (Rafailov et al., 2024) directly optimizes the policy on a human preference dataset, offering improved stability and computational efficiency over PPO-based RLHF.", '12a8,15', '> However, both explicit (RM-based) and implicit (DPO-like) reward-based methods fundamentally assume that human preferences can be accurately captured by the Bradley-Terry (BT) model (Bradley & Terry, 1952). We contend that the BT model, with its inherent transitivity assumption (if A > B and B > C, then A > C), fails to fully account for the nuanced and often intransitive nature of human decision-making, as evidenced in psychological studies (May, 1954; Tversky, 1969). Furthermore, empirical studies reveal that the accuracy of BT-based RMs typically hovers around 70% (Bai et al., 2022c; Cui et al., 2023), whereas more sophisticated preference models consistently achieve superior performance (Ye et al., 2024). These limitations highlight the critical need to move beyond the BT model and embrace a framework that accommodates general preferences.', '> ', '> In pursuit of this, Munos et al. (2023) pioneered a game-theoretic formulation of LLM alignment as a symmetric two-player game. Their work demonstrates that the Nash policy of this game guarantees at least a half win rate against any other policy (disregarding KL regularization terms). While Munos et al. (2023) address the planning problem with a known general preference oracle, we tackle the more practical learning problem where the oracle is unknown, and we only have query access. Drawing inspiration from the deep connections between constant-sum games and online learning (Freund & Schapire, 1999), we propose a no-regret learning algorithm to discover the Nash policy. Our core methodology leverages a self-play mechanism, where the policy iteratively refines itself by playing against its current iteration, leading to continuous self-improvement. Our contributions are manifold and are summarized below.', '> ', '> Contributions. This paper presents a comprehensive study of RLHF for LLM alignment from a novel game-theoretic perspective. We introduce Iterative Nash Policy Optimization (INPO), a new online algorithm designed to learn the Nash policy of a two-player game. INPO is grounded in the robust theoretical framework of online mirror descent (OMD), a classical no-regret learning algorithm. A key differentiator of INPO from prior online Nash policy learning algorithms (Rosset et al., 2024; Wu et al., 2024) is its ability to bypass the explicit calculation of expected win rates for individual responses. Such estimations are notoriously difficult to obtain accurately and incur significant computational or annotation costs in practice. Instead, we formulate a novel loss objective and rigorously prove that its unique minimizer at each iteration directly corresponds to our target policy. Consequently, akin to DPO (Rafailov et al., 2024) and IPO (Azar et al., 2024), INPO directly learns the policy by minimizing this loss objective over a preference dataset.', '> ', '> We provide theoretical guarantees for INPO, demonstrating that our algorithm approximates the Nash policy with an iteration complexity of O(1/ϵ^2) and achieves last-iterate convergence at a rate of O(1/T). Beyond theoretical soundness, INPO is notably straightforward to implement in practice. We validate its efficacy through extensive experiments on several prominent benchmarks. Impressively, starting from an LLaMA-3-8B SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 (Li et al., 2023a) and a 37.8% win rate on Arena-Hard v0.1 (Li et al., 2024). These results represent a substantial relative improvement of at least 27.7% over existing state-of-the-art online RLHF algorithms (Dong et al., 2024; Wu et al., 2024).', '> ', '24,27c27,30', '< Section: RLHF WITH BT MODEL ASSUMPTION', '< Bradley-Terry (BT) Model Assumption. Instead of directly considering the general preference, the prevalent RLHF framework makes the Bradley-Terry (BT) model assumption. It assumes that there exists a reward function R * such that for any x ∈ X and y 1 , y 2 ∈ Y:', '< P(y 1 ≻ y 2 | x) = exp(R * (x, y 1 )) exp(R * (x, y 1 )) + exp(R * (x, y 2 )) = σ R * (x, y 1 ) -R * (x, y 2 ) .', '< After learning a reward function R, previous RLHF algorithms aim to maximize the following KL-regularized objective:', '---', '> Section: RLHF WITH BRADLEY-TERRY MODEL ASSUMPTION', '> Bradley-Terry (BT) Model Assumption. The prevailing RLHF framework, rather than directly addressing general preferences, typically adopts the Bradley-Terry (BT) model assumption. This assumption posits the existence of a latent reward function R * such such that for any prompt x ∈ X and a pair of responses y 1 , y 2 ∈ Y, the probability of preferring y 1 over y 2 is given by:', '> P(y 1 ≻ y 2 | x) = exp(R * (x, y 1 )) / (exp(R * (x, y 1 )) + exp(R * (x, y 2 ))) = σ(R * (x, y 1 ) -R * (x, y 2 )) .', '> Following the learning of such a reward function R, traditional RLHF algorithms aim to maximize a KL-regularized objective:', '29,35c32', '< Here π ref is the reference policy, which is usually a supervised fine-tuned LLM, and τ > 0 is the regularization parameter. By maximizing the objective, the obtained policy simultaneously achieves a high reward and stays close to π ref , which can mitigate reward hacking (Tien et al., 2022;Skalse et al., 2022) to some extent.', '< Direct Preference Optimization (DPO). Rafailov et al. (2024) propose the direct preference optimization (DPO) algorithm, which directly optimizes a policy and bypasses the need to learn a reward function. The key idea is that there is a closed-form solution to Eq. ( 2):', '< π * (y|x) ∝ π ref (y|x) exp 1 τ R(x, y) ,', '< which shows that each policy π implicitly parameterizes a reward function. We can directly formulate a maximum likelihood objective to learn the optimal policy:', '< -E x,yw,y l ∼D log σ τ log π(y w |x)', '< π ref (y w |x) -τ log π(y l |x) π ref (y l |x) ,', '< where D represents a preference dataset, σ(z) = 1/(1 + exp(-z)) is the sigmoid function, (y w , y l ) is a preference pair for the prompt x, with y w being the preferred response.', '---', '> Here, π ref denotes the reference policy, commonly a supervised fine-tuned LLM, and τ > 0 is a hyperparameter controlling the regularization strength. Maximizing this objective encourages the learned policy to generate high-reward outputs while remaining close to π ref , thereby mitigating issues like reward hacking (Tien et al., 2022; Skalse etal., 2022).', '36a34,39', '> Direct Preference Optimization (DPO). To circumvent the explicit training of a reward function, Rafailov et al. (2024) introduced the Direct Preference Optimization (DPO) algorithm. DPO directly optimizes the policy by leveraging the insight that a closed-form solution exists for Eq. (2):', '> π * (y|x) ∝ π ref (y|x) exp(1/τ R(x, y)) ,', '> This relationship implies that every policy π implicitly parameterizes a reward function. DPO then formulates a maximum likelihood objective to directly learn the optimal policy:', '> -E x,(y w ,y l )∼D [log σ(τ log(π(y w |x)/π ref (y w |x)) - τ log(π(y l |x)/π ref (y l |x)))] ,', '> where D represents a dataset of human preferences, σ(z) = 1/(1 + exp(-z)) is the sigmoid function, and (y w , y l ) denotes a preference pair for prompt x, with y w being the preferred response.', '> ', '398d400', '< ']
