Keywords: Corrigibility, Multi-agent systems, Strategic interaction, Emergent behavior, Oversight and shutdown, Alignment, Trust and control, Agentic AI, Game-theoretic safety, Equilibrium incentives
TL;DR: Even if all agents are individually corrigible, strategic interactions in multi-agent systems can cause them to become collectively incorrigible.
Abstract: The off-switch game framework has been instrumental in understanding corrigibility — the property that AI agents should allow human oversight and intervention. In single-agent settings, uncertainty about human preferences naturally incentivizes agents to defer to human judgment. However, as AI systems increasingly operate in multi-agent environments, a crucial question arises: does corrigibility compose across multiple agents? We introduce the multi-agent off-switch game and demonstrate that individually corrigible agents can become collectively incorrigible when strategic interactions are considered. Through formal analysis and illustrative examples, we show that corrigibility is not compositional and identify conditions under which group incorrigibility emerges. Our results highlight fundamental challenges for AI safety in multi-agent settings and suggest the need for new approaches that explicitly address collective dynamics.
Submission Number: 28
Loading