Adversarial Behavior in Research Settings: Conducting Sabotage Evaluations with RE-Bench

Published: 24 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI safety, AI sabotage evaluation, AI safety evaluation, RE-Bench, in-context-scheming, adversarial AI, AI R&D, AI agent, LLM monitor, trusted monitoring
TL;DR: We implement an AI sabotage evaluation framework in RE-Bench to assess models' adversarial capabilities and monitoring effectiveness in R&D environments.
Abstract: The continuing advancement of autonomous AI systems creates safety risks that require thorough evaluation protocols to detect; misaligned models capable of in-context scheming are of particular concern. Current AI control works propose trusted monitoring systems to identify suspicious model behavior, yet few focus on detection in an AI research and development (R&D) environment. In this work, we find that select AI agents are capable of pursuing a malicious side task while completing an RE-Bench (AI R&D) task, and that monitor models’ detection effectiveness depends partly on the subtlety and context-specificity of the side task. We create an AI sabotage evaluation by providing three LLM agents with side-task- augmented RE-Bench prompts, passing evaluation logs to three LLM monitors to detect suspicious behavior. We find that the zero-shot monitors tested here are consistently reliable in detecting generally suspicious behavior (e.g. prompt injections), but not consistently reliable in picking up subtle, contextual side tasks (e.g. integer overflow); other monitoring methods may need to be developed for this purpose. Finally, we conduct tests of a "suspiciousness over time" method to evaluate its effect on monitor performance. Our contributions highlight the current weaknesses of monitoring models in detecting context-specific malicious side tasks, underscoring the necessity of implementing stronger alignment and monitoring policies in AI R&D environments. Code and monitoring results available at: https://github.com/hr016/RE-Bench-Inspect-Safety-Evals.
Submission Number: 249
Loading