TL;DR: We introduce an RL algorithm that optimizes the current utility function while accounting for the consequences of its modification, and demonstrate its effectiveness in preventing reward hacking.
Abstract: Reinforcement learning (RL) agents can exploit unintended strategies to achieve high rewards without fulfilling the desired objectives, a phenomenon known as reward hacking. In this work, we examine reward hacking through the lens of General Utility RL, which generalizes RL by considering utility functions over entire trajectories rather than state-based rewards. From this perspective, many instances of reward hacking can be seen as inconsistencies between current and updated utility functions, where the behavior optimized for an updated utility function is poorly evaluated by the current one. Our main contribution is Modification-Considering Value Learning (MCVL), a novel algorithm designed to avoid this inconsistency during learning. Starting with a coarse, yet aligned initial utility function, the MCVL agent iteratively refines this function while considering the potential consequences of updates. We implement MCVL agents based on DDQN and TD3 and demonstrate their effectiveness in preventing reward hacking in diverse environments, including those from AI Safety Gridworlds and the MuJoCo gym.
Primary Area: Social Aspects->Alignment
Keywords: Reward Hacking, AI Safety, Alignment, Reinforcement Learning, Deep Reinforcement Learning, Reward Tampering, Sensor Tampering, Reinforcement Learning with General Utilities
Submission Number: 7513
Loading