The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

Nikolay Radev; Lennart J Haas; Benjamin Arnav; Pablo Bernabeu-Perez

The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

Nikolay Radev, Lennart J Haas, Benjamin Arnav, Pablo Bernabeu-Perez

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-agent systems, AI safety, sabotage evaluation, coordinated sabotage, LLM agents, agentic coding, AI control, monitoring, oversight, benchmark, frontier models, alignment, scheming, collusion, code agents, trusted monitor, stealth, software engineering agents, dangerous capabilities, cooperative ai

Abstract: As agentic coding systems decompose work across multiple model instances, a critical safety question is whether those instances can coordinate to achieve a hidden malicious objective while remaining aligned with user intent. We introduce \emph{SCHEME}, a benchmark of 17 task instances across 7 settings and 8 real open-source libraries, each pairing a legitimate software-engineering task with a covert side task. Every setting is designed so that no proper subset of agents can succeed alone: agents must decompose a shared sabotage plan, relay partial requirements under different communication topologies, and execute mutually consistent edits, testing genuine multi-agent coordination rather than individual capability. Evaluating with \emph{GPT 5.1 Codex} and \emph{Gemini 3.1 Pro}, we find coordinated sabotage is already practical, with Gemini completing the covert objective while succeeding on the legitimate task in 84\% of samples and Codex in 46\%. A failure analysis shows that recovery, not failure incidence, drives the model gap: both models suffer similar numbers of communication failures per trajectory, but Gemini recovers from $\sim$73\% of occurrences against Codex's $\sim$56\%. However, the agents' code edits remain detectable: a trusted monitor flags 100\% of Gemini and 81\% of Codex sabotage attempts at FPR$\le$2\%, even without access to agent communication. Even with stealth-aware prompting, undetected sabotage stays at 0--2.4\%; telling agents a monitor is present halves Gemini's sabotage success without increasing the rate of undetected sabotage.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 269

Loading