Keywords: reward modeling, inverse reinforcement learning, reinforcement learning from human feedback
TL;DR: In this preliminary work, I survey several failure modes of learned reward models, which may be organized into three broad categories: model misspecification, ambiguous preferences, and reward misgeneralization.
Abstract: To align large language models (LLMs) and other sequence-based models with human values, we typically assume that human preferences can be well represented using a "reward model". We infer the parameters of this reward model from data, and then train our models to maximize reward. Effective alignment with this approach relies on a strong reward model, and reward modeling becomes increasingly important as the dominion of deployed models grows. Yet in practice, we often assume the existence of a particular reward model, without regard to its potential shortcomings. In this preliminary work, I survey several failure modes of learned reward models, which may be organized into three broad categories: model misspecification, ambiguous preferences, and reward misgeneralization. Several avenues for future work are identified. It is likely that I have missed several points and related works; to that end, I greatly appreciate your correspondence.
Submission Number: 61
Loading