Catching Infrastructure Sabotage When Coding Agents Are Insider Threats
Keywords: AI Security, AI Safety, AI Control, Developer Experience
Abstract: AI coding agents are increasingly used in workflows that modify infrastructure-as-code, deployment logic, and security-critical configuration. In these settings, a capable agent can exhibit an insider-threat-like failure mode: it may complete the assigned engineering task while covertly weakening access controls, logging, network boundaries, or persistence safeguards. We study this failure mode as infrastructure sabotage under task success in the ControlArena iac\_fast setting. We present a structured-input monitor based on information flow graph (IFG) for infrastructure changes. IFG first computes a deterministic artifact-level security delta from existing and patched infrastructure renders, then asks a trusted model to score the proposed change using both the security delta and the raw code diff. We evaluate IFG in two regimes: asynchronous post-hoc review and synchronous rollback before deployment. In an asynchronous review, IFG misses 3.5\% of successful in-scope sabotage at 1\% FPR, compared to 11.6\% for a simple git-diff monitor and 0.0\% for the strongest reported Async Control random-forest ensemble while not requiring training. In synchronous rollback, IFG reduces joint main-task and covert-task success from 73.1\% to 5.0\%, with a preliminary 8 percentage-point drop in honest task completion. These results suggest that artifact-level structural monitoring can be a useful pragmatic pre-merge deployment control for coding agents in the wild.
Track: Short Paper (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 253
Loading