Removing Length Bias in RLHF is not Enough

14 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY-NC 4.0
Keywords: LLM, RLHF, Prompt Bias
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become an essential technique for enhancing pretrained large language models (LLMs) to generate responses that align with human preferences and societal values. While RLHF has shown promise, the training of reward models (RMs) still faces the challenge of \emph{reward hacking}, motivating recent works to prevent RMs from finding shortcuts that bypass the intended optimization objectives by identifying simplistic patterns, especially response length. Besides the issue of \emph{length bias}, our work firstly reveal that \emph{prompt-template bias} learned by RMs can also cause \emph{reward hacking} when dealing with marginal samples, resulting in LLMs preferring to generate responses in a specific format after RLHF fine-tuning, regardless of the format requested in the prompt. To this end, we propose a low-cost but effective method, namely Prompt Bias Calibration (PBC), to estimate the \emph{prompt-template bias} term during reward modeling, which can be utilized to calibrate reward scores in the following RL fine-tuning process. Then, we show that our PBC method can be flexibly combined with existing algorithms of removing \emph{length bias}, leading to a further improvement in the aspect of enhancing the quality of generated responses. Experiments results show that the performance of our PBC method and its extensions have significantly surpassed the original implementation of RLHF.
Supplementary Material: zip
Primary Area: Natural language processing
Submission Number: 8296
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview