Keywords: RLHF, PPO, LLM, reinforcement learning, alignment
TL;DR: A new method that mitigates reward hacking in RLHF thanks to a novel reformulation of the reward terms.
Abstract: Reinforcement learning from human feedback (RLHF) is a popular technique to align large language models (LLMs) to human preferences. It requires learning a reward model that predicts scalar values given a generated text sequence, acting as a proxy for human preference scores. A central problem of RLHF is \textit{reward hacking}, i.e., overoptimization. LLMs can easily exploit the reward model by generating text that can receive high scores but no longer align with human preferences. We address this problem by proposing a new objective which adapts the tradeoff between reward model score and regularisation based on reward uncertainty. We hypothesize that when the reward model uncertainty is low, RLHF should make a larger step size by lowering the regularization coefficient. On the other hand, when the uncertainty is high, optimization should slow down by staying closer to the original model. We present a novel re-formulation of the RLHF objective and derive our approach from its generalization to account for reward model variance. We demonstrate that our uncertainty-aware RLHF objective mitigates overoptimization and outperforms vanilla RLHF by 50% on a standard summarization task.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Supplementary Material: zip
Submission Number: 10807
Loading