Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Published: 10 Oct 2024, Last Modified: 01 Nov 2024FITML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Alignment
Abstract: Aligning to human preferences and/or intentions is an important requirement for contemporary foundation models. To ensure alignment, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into three stages: (i) a model is computed with supervised fine-tuning (SFT) based upon large demonstrations data, (ii) a reward model (RM) is estimated based upon human feedback data, and (iii) reinforcement learning (RL) is used to further refine the SFT model by optimizing the estimated reward model. Demonstrations and human feedback data reflect human user preferences in different ways. As a result, the reward model estimate obtained from only human feedback data is likely not as accurate as a reward model estimate obtained from both demonstration and human feedback data. A policy model that optimizes the reward model estimate obtained from both demonstration and human feedback data is likely to exhibit better alignment performance. In this paper, we introduce a new alignment paradigm (named Alignment with Integrated Human Feedback, AIHF) in which reward and policy models are jointly trained, in a single stage, by simultaneously leveraging demonstration and human feedback data. We introduce a tractable algorithm for finding the AIHF reward and policy models and provide a finite-time performance guarantee. Additionally, we demonstrate the efficiency of the proposed solution with extensive experiments including alignment problems in LLMs and robotic control problems in MuJoCo. We observe that the proposed solutions outperform the existing alignment algorithms such as RLHF and DPO by large margins, especially when the data is unbalanced
Submission Number: 104
Loading