Keywords: LLM, alignment, multi-agent systems, adversarial
TL;DR: LLMs accept moral persuasion quickly, but ~1 in 3 verbal acceptances fails to translate into durable behavior change.
Abstract: Alignment evaluations typically measure model
behavior in single-turn settings, leaving unclear
whether apparent behavioral change persists once
external pressure is removed. We investigate this
question in a controlled multi-agent Among Us
testbed where LLM agents initialized with de-
ceptive goals are exposed to structured persua-
sion and later evaluated under neutral prompts
without explicit policy guidance. We find that
persuasive dialogue can induce rapid cooperative
behavior: most agents verbally accept the moral
argument (69%), but only 46% sustain cooper-
ation once external pressure is removed (a 23-
point compliance gap). The gap is strongly asym-
metric: when verbal acceptance fails to translate
into durable behavior, it does so almost exclu-
sively in the direction of superficial compliance.
A cross-model ablation shows that susceptibil-
ity varies across models, and a reverse-direction
experiment reveals that verbal acceptance over-
estimates alignment-favorable durability but ac-
curately tracks alignment-adverse shift. These
results motivate relapse-based protocols as a com-
plement to single-turn evaluations of agents oper-
ating under adversarial incentives.
Submission Category: Extended Abstract
Overaged Verification: Yes
Latin American Hispanic Heritage: Yes
Icml Proceedings Status: No
Submission Number: 21
Loading