Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment
Keywords: Large language models, Fine-tune, alignment, Reinforcement learning
TL;DR: Reward learning even if you only have demonstration dataset improves LLM alignment
Abstract: Aligning human preference and value is important for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) consist of two stages: 1) supervised fine-tuning (SFT), where the model is fine-tuned to imitate human demonstration data; 2) Preference learning, where preference data is used to learn a reward model, which is then used by a reinforcement learning (RL) step to fine-tune the model. In this work, we argue that the SFT stage benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse RL (IRL) technique to build an reward model, while learning the policy model. This approach leads to new SFT algorithms that are not only efficient to implement, but also promote the ability to distinguish between preferred and non-preferred continuations. Our results indicate that it is beneficial to explicitly or implicitly leverage reward learning throughout the entire alignment process.
Submission Number: 26
Loading