DPO Meets PPO: Reinforced Token Optimization for RLHF

Han Zhong; Zikang Shan; Guhao Feng; Wei Xiong; Xinle Cheng; Li Zhao; Di He; Jiang Bian; Liwei Wang

DPO Meets PPO: Reinforced Token Optimization for RLHF

Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang

Published: 01 May 2025, Last Modified: 24 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We formulate RLHF as token-wise MDPs instead of sentence-level bandits, and propose a provable and practical algorithm, RTO, under this framework.

Abstract: In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards---a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. We conduct extensive experiments to evaluate \texttt{RTO} against PPO and other direct preference learning algorithms. The results highlight the effectiveness of RTO, with the algorithm outperforming PPO by 7.5 points on the AlpacaEval 2 benchmark and by 4.1 points on Arena-Hard. Our code and models are available at \href{https://github.com/zkshan2002/RTO}{https://github.com/zkshan2002/RTO}.

Lay Summary: We formulate RLHF as token-wise MDPs instead of sentence-level bandits, and propose a provable and practical algorithm, RTO, under this framework.

Link To Code: https://github.com/zkshan2002/RTO

Primary Area: Deep Learning->Large Language Models

Keywords: RLHF, PPO, sample efficiecncy

Submission Number: 6229

Loading