Subversion Strategy Eval: Evaluating AI’s stateless strategic capabilities against control protocols

Published: 16 Dec 2024, Last Modified: 20 Feb 2025airrworkshop OralandPosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI control, AI safety, capability evaluations, LLM evaluations
Abstract: AI control protocols are plans for safely and usefully deploying AI systems that are robust to the AI intentionally subverting them. Previous work evaluated protocols by subverting them with a human-AI red team, where an AI follows the human-written strategy. This paper investigates how well AI systems can generate and act on strategies for subverting control protocols whilst operating without private memory. Usually, decision-makers only need the ability to generate optimal plans. However, to subvert a protocol, an AI system needs to reliably generate optimal plans in each context; to be able to coordinate plans with other instances of itself without communicating; and to take actions with well-calibrated probabilities. We develop evaluations for a model's ability to generate subversion strategies and `reverse ablations' that help isolate component capabilities. Our initial evaluations with Claude-3.5-sonnet (10-22-2024) highlight reasons why statelessly strategizing to subvert control protocols might be hard. For example, reliably making a large quantity of independent decisions is a key difficulty: A game simulating a deployment with 10 times more independent calls to the model shows a 4 times lower AI winrate, which is consistent with the larger game's increased sensitivity to absolute errors in the AI's policy. We hope that our work can serve as a starting point for evaluations aimed at assessing whether a model is capable of strategizing to subvert control protocols.
Submission Number: 6
Loading