Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Chenlu Ye; Wei Xiong; Yuheng Zhang; Hanze Dong; Nan Jiang; Tong Zhang

Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Chenlu Ye, Wei Xiong, Yuheng Zhang, Hanze Dong, Nan Jiang, Tong Zhang

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLHF Theory

TL;DR: We study theoretical Reinforcement Learning from Human Feedback (RLHF) under a general preference oracle.

Abstract: We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. In particular, we do not assume the existence of a reward function and an oracle preference signal drawn from the Bradley-Terry model as most of the prior works do. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference oracle. The learning objective of this formulation is to find a policy so that it is consistently preferred by the KL-regularized preference oracle over any competing LLMs. We show that this framework is strictly more general than the reward-based one, and propose sample-efficient algorithms for both the offline learning from a pre-collected preference dataset and online learning where we can query the preference oracle along the way of training. Empirical studies verify the effectiveness of the proposed framework.

Supplementary Material: zip

Primary Area: Reinforcement learning

Submission Number: 10332

Loading