P3O: Pessimistic Preference-based Policy Optimization for Robust Alignment from Preferences

Published: 10 Oct 2024, Last Modified: 20 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLHF, Preference-based Reinforcement Learning
TL;DR: We develop a pessimistic method for learning from human preferences with sound theoretical guarantees, and show a resilience to overoptimization in document summarization.
Abstract: We study reinforcement learning (RL) settings where the agent only has access to preferences on the relative quality of a pair of trajectories, obtained as a fixed \emph{offline preference dataset}, where pairs of trajectories collected according to some base policy are labeled with the preference feedback. A reward or pairwise preference function trained from this offline dataset is then used to provide feedback during RL training, and there is a substantial body of work on RL methods for these settings. However, a bulk of the literature ignores the uncertainty of the learned preference function, which leads to reward hacking or overoptimization. In this work, we formulate theoretically sound objectives for preference-based RL which are provably robust to overoptimization through the use of pessimism in the face of uncertainty, and design practical algorithms to optimize these objectives. We evaluate our algorithms on the task of fine-tuning language models from human feedback, and show a remarkable resilience to overoptimization.
Submission Number: 112
Loading