Research Area: Alignment, Safety, Learning algorithms for LMs
Keywords: RLHF, Alignment, Comparative RL, LLM
TL;DR: We propose an algorithm that use comparative RL for learning human preference, out perform PPO and DPO
Abstract: LLMs may exhibit harmful behavior without aligning with human values. The dominant approach for steering LLMs towards beneficial behavior is Reinforcement Learning with Human Feedback (RLHF). This involves training a reward model with a human-labeled ranking dataset and fine-tuning the LLM with the reward signal using RL. Despite the fact that the reward is learned from comparing different responses, the RL stage doesn't involve direct comparisons. This inconsistency between reward learning and reinforcement learning stages exacerbates RL's instability. An example would be that the well adopted RL optimizer, Proximal Policy Optimization (PPO), could perform different gradient updates even for batches with identical human preference information. To address this, we propose a new framework, reinforcement learning with comparative feedback, and a simple policy gradient algorithm, Pairwise Proximal Policy Optimization (P3O), that learns to improve from direct comparison. Theoretically, P3O has the nice property of being invariant with any reward functions that contain identical preference information, while doesn't require learning a value function. Empirical evaluations demonstrate that P3O can align with human preferences better than existing methods. This suggest that comparative RL is strong candidate for aligning LLM with preference data.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 622
Loading