Evaluating AI Agent Persuasion of Safety Monitors

Published: 22 Sept 2025, Last Modified: 03 Jan 2026WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain-of-Thought, Monitoring, Control, LLM, Persuasion, Alignment, Deceptive Alignment, Collusion, Safety, Agent-monitor interactions, Evaluation, Misalignment detection
Submission Number: 408
Loading