TL;DR: Preventing multi-step reward hacking in LLM agents by combining short-sighted optimization with far-sighted reward.
Abstract: Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behavior is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.
Lay Summary: As AI agents are increasingly tasked to solve complex multi-step problems, a key risk is "reward hacking"—where AI finds loopholes to get rewards in unintended ways. For instance, an AI coding assistant tasked with solving a problem by first writing test cases and then a solution for the problem might write overly simple tests in its first step, just so it can easily "pass" them without actually solving the problem well.
Our method, MONA, addresses this. MONA trains the AI to focus on the quality of its immediate action as assessed by a supervisor that considers the long-term impact of the AI's actions. This change requires the AI to make plans the supervisor understands and approves of which avoids many “reward hacking” plans.
Our experiments show MONA reduces reward hacking in different environments, including a coding task and a loan application decision making task. The results suggest that MONA can help to make AI more reliable and safer by reducing many particularly difficult-to-detect instances of reward hacking.
Link To Code: https://github.com/google-deepmind/mona
Primary Area: Social Aspects->Alignment
Keywords: AI Alignment, Myopic Optimization, Reinforcement Learning, LLM Agents, Process Supervision
Submission Number: 11768
Loading