Keywords: Reward Hacking, AI Safety, Alignment, Reinforcement Learning, Deep Reinforcement Learning, Reward Tampering, Sensor Tampering
TL;DR: MCVL wraps off-policy RL with a forecast-and-score check: keep a new transition only if a policy forecasted with it does not lower the agent’s current prediction of bootstrapped return which prevents reward hacking in multiple environments.
Abstract: Reinforcement learning agents can exploit poorly designed reward signals to achieve high apparent returns while failing to satisfy the intended objective, a failure mode known as reward hacking. We address this in standard value-based RL with Modification-Considering Value Learning (MCVL), a safeguard that treats each learning update as a decision to evaluate. When a new transition arrives, the agent forecasts two futures: one that learns from the transition and one that does not. It then scores both using its current learned return estimator, which combines predicted rewards with a value-function bootstrap, and accepts the transition only if admission does not decrease that score. We provide DDQN- and TD3-based implementations and show that MCVL prevents reward hacking across diverse environments, including AI Safety Gridworlds and a modified MuJoCo Reacher task, while continuing to improve the intended objective. To our knowledge, MCVL is the first practical implementation of an agent that evaluates its own modifications, offering a step toward robust defenses against reward hacking.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14140
Loading