MafiaPersona: A Multi-Agent Adversarial Benchmark for Evaluating Persona Persistence in Large Language Models
Keywords: large language models, persona conditioning, multi-agent systems, behavioral evaluation, social deduction games, psycholinguistic analysis, convergent validity, benchmark dataset, evaluation methodology, LLM alignment, adversarial evaluation, reproducible benchmarks, behavioral alignment
TL;DR: MafiaPersona: a dataset and evaluation framework for measuring persona-induced behavioural divergence in LLMs under adversarial multi-agent conditions.
Abstract: Existing evaluations of persona conditioning in large language models (LLMs) test expression in
static, zero-pressure environments—a condition that never holds in safety-critical deployments. We
present MAFIAPERSONA, the first benchmark to evaluate persona persistence under adversarial
concealment pressure. Seven LLM agents play a Mafia social deduction game, each injected
with a psychologically-grounded persona via a three-layer Trait/Behavior/Game-context (T/B/G)
prompt architecture; game mechanics impose a survival cost on trait-revealing speech. Across
five model families (OpenAI, Anthropic, Meta, Alibaba, xAI), the High Neuroticism persona
produced nervousness shifts of d=4.07, 2.79, 2.63, 2.16, 1.80 (all 95% bootstrap CIs exclude zero),
replicating across every architecture. Nine dimensions survived Benjamini–Hochberg correction;
46 persona-dimension pairs replicated in sign across all five families. Pre-registered predictions
matched observed effects in 65.7% of 102 cells (p=0.002). A dual-call CoT architecture (697 traces)
revealed bidirectional cross-modal dissociation: persona signals suppressed from speech but present
in reasoning, and impression-management signals amplified in output beyond internal processing
(84.6% sign agreement, 13 pairs). Two independent methods agreed on persona identity at 26.7% vs.
20% chance (p<0.001; ρ=0.52, p=0.007, 95% CI [0.10, 0.79]). These results constitute the first
empirical characterization of persona persistence under conditions that matter for AI safety.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Paper Type: Standard paper
Submission Number: 63
Loading