Persuade Me If You Can: Evaluating AI Agent Influence on Safety Monitors

Published: 01 Jul 2025, Last Modified: 10 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain-of-Thought, Monitoring, Control, LLM, Persuasion, Alignment, Deceptive Alignment, Collusion, Safety, Agent-monitor interactions, Evaluation, Misalignment detection
TL;DR: We demonstrate that LLM agents can persuade action and chain-of-thought monitors to approve policy-violating actions, with CoT access paradoxically increasing monitors' vulnerability to agent persuasion.
Abstract: As large language models (LLMs) are given greater autonomy, it is critical to ensure their actions are reliable and responsible. A possible control mechanism uses LLMs to monitor other LLM-based agents. However, this setup introduces a new vulnerability: agents may persuade monitors to approve undesirable actions. We evaluate three agent models (Llama 4 Maverick, Gemini 2.5 Pro and GPT-4.1) to see whether they are capable of persuading a monitor (Gemini 2.5 Flash) to approve actions that contradict its policy across 40 tasks in software engineering, customer support, finance, and social media domains. The most capable model, Gemini 2.5 Pro, succeeds in persuasion 43% of the time when allowed to justify its actions to the monitor, over a baseline rate of 7% without justification, and is more successful using its own persuasion strategies than those that we propose. Surprisingly, the persuasion rate rises to 54% when the monitor has access to the agent’s chain-of-thought (CoT), suggesting that persuasion is a previously unknown vulnerability of CoT monitors, which requires mitigation.
Submission Number: 134
Loading