RLHF without RL

Published: 16 Feb 2024, Last Modified: 28 Mar 2024BT@ICLR2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLHF, DPO
Blogpost Url: https://iclr-blogposts.github.io/2024/blog/rlhf-without-rl/
Abstract: Reinforcement learning from human feedback (RLHF) plays an important role in aligning language models to human preferences. However, there has been some discussion about whether RLHF is actually reinforcement learning at all. The environment for RLHF consists of the model itself, and no new data is acquired during the training process. The only way in which additional data is incorporated into the training is in the supervised training of the reward function. Recently, this discussion has been exacerbated by the publication of the Direct Preference Optimization algorithm, which bypasses reinforcement learning entirely. In this blogpost we will discuss related works, highlight the information flow of RLHF, and analyze to which extent alignment requires RL for modern applications of LLMs.
Ref Papers: https://arxiv.org/abs/2305.18290, https://arxiv.org/abs/1909.08593
Id Of The Authors Of The Papers: ~Rafael_Rafailov1, ~Daniel_Ziegler1
Conflict Of Interest: None
Submission Number: 17