Enhancing Monitoring in AI Control through Trusted Debate

Published: 08 Nov 2025, Last Modified: 27 Nov 2025ResponsibleFM @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: ai safety, ai control, debate, scalable oversight, trusted monitoring, large language models, natural language processing
TL;DR: Trusted debate is a control protocol where two trusted agents argue opposing views about code safety under a judge’s evaluation to expose hidden vulnerabilities that standard monitoring might miss.
Abstract: As increasingly capable AI systems are deployed in high-stakes settings, ensuring robust oversight in the face of intentional subversion has become a central challenge for AI safety. The AI control paradigm develops protocols that enable safe usage of advanced AI systems, even in scenarios where those systems may pursue objectives misaligned with human intentions and actively attempt to subvert control. In this paper, we build upon trusted monitoring, a protocol where a harmless, less capable model oversees the actions of a more powerful, untrusted model, by extending it with adversarial debate. Debate is a scalable oversight technique in which multiple agents contest opposing viewpoints, allowing the dynamics between adversaries to surface subtle errors, deceptions or risks that may elude traditional monitoring. Our control protocol, trusted debate, involves two trusted agents engaging in structured adversarial dialogue, where one argues that a piece of code contains a backdoor while the other defends the validity of the code. Their exchange is then evaluated by a trusted judge, who produces a suspicion score capturing the likelihood of malicious functionality in the code. By exploiting adversarial dynamics, trusted debate aims to surface hidden vulnerabilities that might evade direct inspection, strengthening the monitor’s detection ability. We compare trusted debate to trusted monitoring and trusted consultancy, where an individual debater introduces arguments in favor of their assigned side with no opposition. Finally, we discuss the effectiveness, limitations, and integration of trusted debate within broader AI control strategies.
Submission Number: 142
Loading