Keywords: AI control, AI safety, control evaluations, AI safety evaluations, RE-Bench, red-team, in-context-scheming, benchmark, LLM, AI R&D
TL;DR: We implement an AI control framework in RE-Bench to assess models' adversarial capabilities and monitoring effectiveness in R&D environments.
Abstract: The continuing advancement of autonomous AI systems creates safety risks that require thorough evaluation protocols to detect; misaligned models, capable of in-context scheming, are of particular concern. Current AI control works propose trusted monitoring systems to identify suspicious model behavior, yet few focus on detection in an AI research and development (R\&D) environment. In this work, we find that select AI models are capable of pursuing malicious side objectives while completing a main RE-Bench (R\&D benchmark) task, and that monitor models' detection effectiveness depends partly on how context-specific the side task is. Using an AI control framework, we red-team three Untrusted Models with adversarial objectives, passing evaluation logs to three Trusted Models. We find that qualitative monitor analysis serves as a better suspiciousness indicator than a numerical rating system. Finally, we conduct preliminary tests of a "suspiciousness over time" method to evaluate its effect on monitor performance. Our contributions highlight the current weaknesses of monitoring models for context-specific malicious side tasks, underscoring the necessity of implementing stronger alignment and monitoring policies in AI R\&D environments. Full code and monitoring results will be released in GitHub upon acceptance.
Submission Number: 249
Loading