Persuasion Attacks Can Decrease Effectiveness of CoT Monitoring
Keywords: AI safety, LLM agents, Chain-of-thought reasoning, Adversarial robustness, Model oversight, Persuasion, AI alignment, LLM evaluation, Trustworthy AI, Multi-agent systems, Deception, Fact-checking, Agent oversight, Frontier models, Red teaming
Abstract: Chain-of-thought (CoT) monitoring has been proposed as a safety mechanism for AI agents, under the assumption that visible reasoning traces help detect misaligned or deceptive behavior. Recent work has shown that LLMs are vulnerable to persuasion-based jailbreaks, where natural-language arguments can override model constraints. We analyze whether this vulnerability extends to monitoring LLMs: can an adversarial agent persuade its CoT monitor to approve proposed actions that violate the monitor’s policy? We design an evaluation framework with 40 tasks and analyze thousands of agent-monitor interactions, where agents are instructed to argue for policy-violating proposals. We find that in such adversarial settings, monitor access to the agent’s CoT reasoning increases rather than decreases approval of harmful actions on average by 9.5%, as the scratchpad provides an additional persuasion channel. To address this, we introduce a factchecking monitoring framework. We find that a fact-checker and monitor pairing from different model families, for example a Claude 3.7 Sonnet monitor paired with a GPT-4.1 fact-checker, reduces approval of policy-violating actions by up to 45%, compared to only 6%, when using the same model for both fact-checking and monitoring roles. Our results demonstrate that CoT monitoring alone may be insufficient against adversarial persuasion, and that model-diverse fact-checking provides a robust mitigation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 209
Loading