Deceive, Detect, and Disclose: Large Language Models Playing Mini-Mafia

Davi Bastos Costa; Renato Vicente

Deceive, Detect, and Disclose: Large Language Models Playing Mini-Mafia

Davi Bastos Costa, Renato Vicente

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, multi-agent, benchmark, deception, social intelligence

TL;DR: We introduce Mini-Mafia, a benchmark where LLMs play a social deduction game to evaluate their deception, detection, and disclosure skills in multi-agent settings.

Abstract: Mafia is a social deduction game where informed mafia compete against uninformed townsfolk. Its asymmetry of information and reliance on theory-of-mind reasoning mirror real-world multi-agent scenarios, making it a useful testbed for evaluating the social intelligence of large language models (LLMs). To support a systematic study, we introduce *Mini-Mafia*: a simplified four-player variant with one mafioso, one detective and two villagers. We set the mafioso to kill a villager and the detective to investigate the mafioso during the night, reducing the game to a single day phase of discussion and voting. Remarkably, we find that the mafia win-rate $p$ in this three-agent system can be described by a simple theoretical model: $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ are intrinsic model parameters representing the mafioso deceive, the villager detection , and the detective disclosure capabilities, respectively. This model successfully predict any game combination outcome from intrinsic model parameters. Estimating these parameters from LLM gameplay data using Bayesian inference creates the *Mini-Mafia Benchmark*. Our experiments reveal counterintuitive results, including cases where smaller models significantly outperform larger ones. We also establish human baselines performance, revealing that LLMs excel at persuasive communication but lag in strategic reasoning for agentic interaction. Beyond benchmarking, Mini-Mafia enables quantitative study of emergent multi-agent dynamics such as name bias and last-speaker advantage, and contributes to AI safety by generating training data for deception detectors.

Primary Area: datasets and benchmarks

Submission Number: 19212

Loading