MafiaPersona: A Multi-Agent Adversarial Benchmark for Evaluating Persona Persistence in Large Language Models

Ojaswi Prakash; Dhruv Kumar; Murari Mandal; Mohan Kankanhalli; Yash Sinha

MafiaPersona: A Multi-Agent Adversarial Benchmark for Evaluating Persona Persistence in Large Language Models

Ojaswi Prakash, Dhruv Kumar, Murari Mandal, Mohan Kankanhalli, Yash Sinha

Published: 07 Jun 2026, Last Modified: 07 Jun 2026ICML 2026 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, persona conditioning, multi-agent systems, behavioral evaluation, social deduction games, psycholinguistic analysis, convergent validity, benchmark dataset, evaluation methodology, LLM alignment, adversarial evaluation, reproducible benchmarks, behavioral alignment

TL;DR: MafiaPersona: a dataset and evaluation framework for measuring persona-induced behavioural divergence in LLMs under adversarial multi-agent conditions.

Abstract: Existing evaluations of persona conditioning in large language models (LLMs) test expression in static, zero-pressure environments—a condition that never holds in safety-critical deployments. We present MAFIAPERSONA, the first benchmark to evaluate persona persistence under adversarial concealment pressure. Seven LLM agents play a Mafia social deduction game, each injected with a psychologically-grounded persona via a three-layer Trait/Behavior/Game-context (T/B/G) prompt architecture; game mechanics impose a survival cost on trait-revealing speech. Across five model families (OpenAI, Anthropic, Meta, Alibaba, xAI), the High Neuroticism persona produced nervousness shifts of d=4.07, 2.79, 2.63, 2.16, 1.80 (all 95% bootstrap CIs exclude zero), replicating across every architecture. Nine dimensions survived Benjamini–Hochberg correction; 46 persona-dimension pairs replicated in sign across all five families. Pre-registered predictions matched observed effects in 65.7% of 102 cells (p=0.002). A dual-call CoT architecture (697 traces) revealed bidirectional cross-modal dissociation: persona signals suppressed from speech but present in reasoning, and impression-management signals amplified in output beyond internal processing (84.6% sign agreement, 13 pairs). Two independent methods agreed on persona identity at 26.7% vs. 20% chance (p<0.001; ρ=0.52, p=0.007, 95% CI [0.10, 0.79]). These results constitute the first empirical characterization of persona persistence under conditions that matter for AI safety.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Paper Type: Standard paper

Submission Number: 63

Loading