From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Rafael Rafailov; Joey Hejna; Ryan Park; Chelsea Finn

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Alignment, Science of LMs, Learning algorithms for LMs, LMs and interactions

Keywords: RLHF, Alignment, Direct Preference Optimization

TL;DR: We prove that direct alignment algorithms, such as DPO can be interpreted in a Q learning framework. This yields novel insight, empirical results and new potential applications.

Abstract: Reinforcement Learning From Human Feedback (RLHF) has been a critical component of the success of the latest generation of generative AI models, including the GPT series. However, this is an involved and complex process and direct alignment algorithms, such as DPO have recently emerged as an alternative approach to the classical RLHF pipeline. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference, first we theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy and empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of SFT policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-tun dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1021

Loading