No One Monitor Fits All: Oversight Strategies for Frontier Agents

Published: 02 Mar 2026, Last Modified: 27 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI safety, Multi-Agent Debate, Scalable Oversight
Abstract: How should we monitor AI agents for misbehavior? We conduct the first systematic comparison of interrogation and debate-based monitoring strategies across multiple frontier models on realistic agent tasks. Evaluating 5 models × 5 strategies × 2 task types, we challenge three core assumptions about AI oversight. First, we find no universal optimal strategy: the best approach varies dramatically by monitor model (GPT-5 only needs direct monitoring; GPT-4o improves +0.31 AUC with blind debate) and task type. Second, we identify a capability threshold: monitoring strategies disproportionately benefit stronger models, with weaker models like GPT-4o-mini showing smaller improvements regardless of approach. Third, we discover that more context can hurt: judges can be misled by agent reasoning on reward hacking tasks, and blind debate (where judges see only arguments of debaters, not trajectories) achieves the highest performance averaged over all models. Our results suggest that current monitoring strategies are brittle and fail to significantly boost the performance of weak monitors, necessitating novel strategies for better weak-to-strong monitoring.
Submission Number: 25
Loading