Analyzing Reward Functions via Trajectory Alignment

Published: 10 Oct 2024, Last Modified: 29 Oct 2024NeurIPS 2024 Workshop on Behavioral MLEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Reward Alignment
Abstract: Reward design in reinforcement learning (RL) is often overlooked, with the assumption that a well-defined reward is readily available. However, reward functions can be challenging to design and prone to reward hacking, potentially leading to unintended or dangerous consequences in real-world applications. To create safe RL agents, reward alignment is crucial. We define reward alignment as the process of designing reward functions that preserve the preferences of a human stakeholder. In practice, reward functions are designed with training performance as the primary measure of success; this measure, however, may not reflect alignment. This work studies the practical implications of reward design on alignment. Specifically, we (1) propose a reward alignment metric, the Trajectory Alignment coefficient, that measures the similarity between the preference orderings of a human stakeholder and the preference orderings induced by a reward function, (2) use this metric to quantify the prevalence and extent of misalignment in human-designed reward functions, and (3) examine how misalignment affects the efficacy of these human-designed reward functions in terms of training performance.
Submission Number: 17
Loading