The Oversight Game: Learning AI Control and Corrigibility in Markov Games

Published: 28 Nov 2025, Last Modified: 30 Nov 2025NeurIPS 2025 Workshop MLxOREveryoneRevisionsBibTeXCC BY 4.0
Keywords: scalable oversight, ai contorl, alignment, safety, markov potential games
TL;DR: We cast scalable oversight as an MPG wrapper around a frozen agent, proving reduced deferral can’t harm the human and empirically eliminating violations without hurting task reward.
Abstract: As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal interface where an agent chooses autonomously (play) or defers (ask), while a human simultaneously chooses to be permissive (trust) or to engage (oversee), which can trigger a correction. We model this as a two-player Markov Game, focusing particularly on the case where it qualifies as a Markov Potential Game (MPG). We show the MPG structure yields a powerful alignment guarantee: under a structural assumption on the human's value, any decision by the agent to act more autonomously that benefits itself cannot harm the human's value. This model provides a transparent control layer where the agent learns to defer when risky and act when safe, while its pretrained policy remains untouched. Our gridworld simulation shows that through independent learning, an emergent collaboration avoids safety violations, demonstrating a practical method for making misaligned models safer after deployment.
Submission Number: 64
Loading