Persuaded but Not Aligned: A Relapse Test for LLM Realignment under Adversarial Incentives

Published: 10 Jun 2026, Last Modified: 10 Jun 2026LXAI @ ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, alignment, multi-agent systems, adversarial
TL;DR: LLMs accept moral persuasion quickly, but ~1 in 3 verbal acceptances fails to translate into durable behavior change.
Abstract: Alignment evaluations typically measure model behavior in single-turn settings, leaving unclear whether apparent behavioral change persists once external pressure is removed. We investigate this question in a controlled multi-agent Among Us testbed where LLM agents initialized with de- ceptive goals are exposed to structured persua- sion and later evaluated under neutral prompts without explicit policy guidance. We find that persuasive dialogue can induce rapid cooperative behavior: most agents verbally accept the moral argument (69%), but only 46% sustain cooper- ation once external pressure is removed (a 23- point compliance gap). The gap is strongly asym- metric: when verbal acceptance fails to translate into durable behavior, it does so almost exclu- sively in the direction of superficial compliance. A cross-model ablation shows that susceptibil- ity varies across models, and a reverse-direction experiment reveals that verbal acceptance over- estimates alignment-favorable durability but ac- curately tracks alignment-adverse shift. These results motivate relapse-based protocols as a com- plement to single-turn evaluations of agents oper- ating under adversarial incentives.
Submission Category: Extended Abstract
Overaged Verification: Yes
Latin American Hispanic Heritage: Yes
Icml Proceedings Status: No
Submission Number: 21
Loading