Modification-Considering Value Learning for Reward Hacking Mitigation in RL

Evgenii Opryshko; Umangi Jain; Igor Gilitschenski

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

Evgenii Opryshko, Umangi Jain, Igor Gilitschenski

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward Hacking, AI Safety, Alignment, Reinforcement Learning, Deep Reinforcement Learning, Reward Tampering, Sensor Tampering, Reinforcement Learning with General Utilities

TL;DR: We introduce an RL algorithm that optimizes the current utility function while accounting for the consequences of its modification, and demonstrate its effectiveness in preventing reward hacking.

Abstract: Reinforcement learning (RL) agents can exploit unintended strategies to achieve high rewards without fulfilling the desired objectives, a phenomenon known as reward hacking. In this work, we examine reward hacking through the lens of General Utility RL, which generalizes RL by considering utility functions over entire trajectories rather than state-based rewards. From this perspective, many instances of reward hacking can be seen as inconsistencies between current and updated utility functions, where the behavior optimized for an updated utility function is poorly evaluated by the original one. Our main contribution is Modification-Considering Value Learning (MC-VL), a novel algorithm designed to address this inconsistency during learning. Starting with a coarse yet value-aligned initial utility function, the MC-VL agent iteratively refines this function based on past observations while considering the potential consequences of updates. This approach enables the agent to anticipate and reject modifications that may lead to undesired behavior. To validate our approach, we implement MC-VL agents based on the Double Deep Q-Network (DDQN) and Twin Delayed Deep Deterministic Policy Gradients (TD3), demonstrating their effectiveness in preventing reward hacking in diverse environments, including those from AI Safety Gridworlds and the MuJoCo gym.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11185

Loading