A Long Way To Go: Investigating Length Correlations in RLHF

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Natural Language Processing, Large Language Models, RLHF, Reward Hacking
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Many of the gains of RLHF with open reward models can be attributed to making outputs longer.
Abstract: Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. As the growth of open-source preference datasets, reward models, and language models has enabled wider experimentation, RLHF's benefits have been demonstrated in settings beyond general chat agents, including web question answering, summarization, and multi-turn dialogue. However, RLHF has also been consistently observed to drive models to produce longer outputs. This paper demonstrates that optimizing for response length is a significant factor behind RLHF's reported improvements on helpfulness. First, we study the relationship between reward and length for reward models trained on three open-source preference datasets. In these settings, length correlates strongly with reward, and improvements in reward score are driven in large part by shifting the distribution over output lengths. We then explore interventions during both RL and reward model learning to see if we can achieve the same downstream improvements as RLHF without increasing length. While our interventions mitigate length increases, they aren't uniformly effective across settings. Furthermore, we find that even running RLHF with a reward based solely on length can reproduce most of the downstream improvements over the initial policy model, showing that reward models in these settings have a long way to go.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8099
Loading