Reward Model Underspecification in Language Model Alignment

Published: 28 Oct 2023, Last Modified: 02 Apr 2024DistShift 2023 PosterEveryoneRevisionsBibTeX
Keywords: alignment, reward models, underspecification, ensembles
TL;DR: Reward models must perform well OOD to be useful, but their OOD performance is underspecified: it varies significantly with the pretraining seed.
Abstract: Reward models play a key role in aligning language model applications towards human preferences. However, this setup can create a dynamic in which the policy model has the incentive to exploit errors in the reward model to achieve high reward. This means that the success of reward-based alignment depends on the ability of reward models to transfer to new distributions created by the aligned policy model. We show that reward models are \emph{underspecified}, in the sense that models that perform similarly in-distribution can yield very different rewards on policy model outputs. These differences propagate to the aligned policies, which we show to be heavily influenced by the random seed used during \emph{pretraining} of the reward model. We show that even a simple alignment strategy --- best-of-$n$ reranking --- creates a semi-adversarial dynamic between the policy and reward models, promoting outputs on which the reward models are more likely to disagree. Finally, we show that a simple ensembling strategy can help to address this issue.
Submission Number: 65