Keywords: AI safety, AI sabotage evaluation, AI safety evaluation, RE-Bench, in-context-scheming, adversarial AI, AI R&D, AI agent, LLM monitor, trusted monitoring
TL;DR: We implement an AI sabotage evaluation framework in RE-Bench to assess models' adversarial capabilities and monitoring effectiveness in R&D environments.
Abstract: The continuing advancement of autonomous AI systems creates safety risks that
require thorough evaluation protocols to detect; misaligned models capable of
in-context scheming are of particular concern. Current AI control works propose
trusted monitoring systems to identify suspicious model behavior, yet few focus on
detection in an AI research and development (R&D) environment. In this work,
we find that select AI agents are capable of pursuing a malicious side task while
completing an RE-Bench (AI R&D) task, and that monitor models’ detection
effectiveness depends partly on the subtlety and context-specificity of the side task.
We create an AI sabotage evaluation by providing three LLM agents with side-task-
augmented RE-Bench prompts, passing evaluation logs to three LLM monitors
to detect suspicious behavior. We find that the zero-shot monitors tested here
are consistently reliable in detecting generally suspicious behavior (e.g. prompt
injections), but not consistently reliable in picking up subtle, contextual side tasks
(e.g. integer overflow); other monitoring methods may need to be developed for
this purpose. Finally, we conduct tests of a "suspiciousness over time" method to
evaluate its effect on monitor performance. Our contributions highlight the current
weaknesses of monitoring models in detecting context-specific malicious side tasks,
underscoring the necessity of implementing stronger alignment and monitoring
policies in AI R&D environments. Code and monitoring results available at:
https://github.com/hr016/RE-Bench-Inspect-Safety-Evals.
Submission Number: 249
Loading