When is RL better than DPO in RLHF? A Representation and Optimization Perspective

Published: 19 Mar 2024, Last Modified: 06 May 2024Tiny Papers @ ICLR 2024 NotableEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning from Human Feedback, Direct Preference Optimization, Generalization
TL;DR: We show that RL is superior to DPO in RLHF when representation is well-specified and optimization is conducted sufficiently
Abstract: Aligning large language models with human preferences is important, and there are two kinds of alignment methods. The first class of algorithms is based on reinforcement learning (RL), which involves learning a reward function from a human preference dataset and improving performance via online reward maximization. Another class is characterized by direct preference optimization, exemplified by DPO (Rafailov et al., 2023), which learns an implicit reward and improves performance directly using a static offline dataset. Which algorithm performs well? We investigate this question using contextual bandits, which serve as mathematical models for alignment. We have two findings: First, we show that DPO may suffer from a reward quality issue when the feature representation is misspecified. Second, we present the error bounds for RL algorithms and show that they achieve the best improvement when the online updates are sufficient. The code to reproduce our results is available at https://github.com/liziniu/policy_optimization.
Submission Number: 206