Keywords: Safety, Code generation, Red teaming, Large language models, Monitoring, Alignment, Security, AI Control
TL;DR: We evaluate and improve safety protocols for large language models in multi-step agentic settings, aiming to prevent dangerous actions while maintaining performance.
Abstract: As more powerful AI agents are deployed with greater tool use and autonomy, it will become increasingly important to ensure they cannot take severely damaging actions. Recent work introduced "control evaluations," which study whether monitoring and security protocols intended to prevent dangerous actions are robust to intentional subversion by these AIs. In a control evaluation, a *red team* seeks to subvert the safety protocols built by the *blue team*. The blue team has access to a weaker *trusted* model and a limited amount of high-quality, trusted labor to attempt to safely extract work from a stronger *untrusted* model. Prior work conducted a control evaluation using competition programming problems with the red team objective of writing subtly incorrect (or "backdoored") solutions (Greenblatt et al. 2023). We extend this work by performing a control evaluation in a more realistic setting in which an untrusted LLM agent executes commands in a bash shell. The blue team seeks to use that agent to complete our novel dataset of 200 challenging multi-step tasks, each of which requires the agent to iteratively develop a bash script that passes test cases. The red team creates an attack policy that uses the untrusted model to covertly download and execute a dummy malware file while completing tasks without getting caught by the blue team's safety protocols. This agentic setting allows both teams to explore complex new strategies that have not been previously studied. We extend protocols from previous work to this agentic setting, propose novel protocols, and evaluate each on our safety and usefulness metrics. We find that our protocols substantially improve the Pareto frontier of usefulness and safety relative to simple baselines: one of our best protocols exhibits approximately a 2.5 times improvement in safety metrics with no statistically significant decrease in usefulness compared to our trusted monitoring baseline.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13747
Loading