LinuxArena: A Control Setting for AI Agents in Live Production Software Environments

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Control, Sabotage evals, Monitoring
TL;DR: A next control setting to measure AIs sabotage ability and monitors ability to catch sabotage
Abstract: As AI agents are given more autonomy in software engineering workflows, the risks grow if they pursue goals different from the ones their users intended. The field of AI control develops control protocols that prevent such harm without restricting the agent's ability to do useful work. We introduce \textbf{LinuxArena}, a control setting for testing such protocols, where agents operate directly on live, multi-service production environments. LinuxArena contains 10 public environments (with an additional 10 held out privately), providing 906 main tasks representing legitimate software engineering work and 92 side tasks representing safety failures such as data exfiltration and access-control bypassing, making it the largest and most diverse control setting for agentic software engineering to date. We demonstrate LinuxArena's utility by running sabotage and monitor evaluations: against a GPT-5 Nano trusted monitor at a 1% step-wise false positive rate, Claude Opus 4.6 achieves an undetected sabotage rate of 42%. We additionally release \textbf{LaStraj}, a dataset of human-crafted attack trajectories that achieves an undetected sabotage rate of 97% against the same monitor. Together, these results suggest meaningful headroom for both attackers and defenders, making LinuxArena a strong testbed for developing and evaluating future control protocols.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 135
Loading