A Simple Reward Composition Method for Effectively Finetuning the Large Language Model with Diverse Feedbacks
Abstract: Reinforcement learning from human feedback has emerged as a promising paradigm that significantly enhances the performance of large language models. Typically, reward models are trained to align with human preferences and are then utilized to optimize the pretrained language models. However, given the multifaceted nature of human preferences, it is challenging to appropriately combine rewards from different aspects. Recent studies have developed algorithms to address this issue by employing techniques such as weighting, and ranking. Nonetheless, these methods can perform poorly in certain scenarios despite their elegant design. In this paper, we explore the reward composition problem from a novel perspective. We posit that different reward models focus on distinct optimization directions, which the language model cannot discern, perceiving only the reward value. To formulate an appropriate reward signal, we introduce a global reward model that composes rewards from various aspects in a self-supervised manner, a simple yet effective approach. This global reward model can be trained without the need for additional supervised data and is compatible with any type of reward model. Experimental results demonstrate the superiority of our method across a range of scenarios with different types of rewards.
Paper Type: long
Research Area: Machine Learning for NLP
Contribution Types: Theory
Languages Studied: English
0 Replies
Loading