WOLF: Werewolf-based Observations for LLM Deception and Falsehoods

Mrinal Agarwal; Saad Rana; Theo Sundoro; Hermela Berhe; Spencer Kim; Vasu Sharma; Sean O'Brien; Kevin Zhu

WOLF: Werewolf-based Observations for LLM Deception and Falsehoods

Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe, Spencer Kim, Vasu Sharma, Sean O'Brien, Kevin Zhu

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 SpotlightEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: deception, detection, large language models, multi-agent interaction, social deduction, Werewolf, benchmark, role-grounded agents, statement-level analysis, suspicion dynamics, calibration, trust and mistrust

TL;DR: WOLF is a Werewolf-based benchmark that systematically measures deception and detection in LLMs, showing that while models generate lies easily, their ability to detect them lags underscoring a critical gap for trust in multi-agent systems.

Abstract: Deception is a fundamental challenge for multi-agent reasoning: effective systems must both conceal information strategically and detect misleading behavior in others. Yet most evaluations reduce deception to static classification, ignoring the interactive, adversarial, and longitudinal nature of real deceptive dynamics. Large language models (LLMs) can deceive convincingly, yet remain weak at detecting deception in others. We present WOLF, a multi-agent social deduction benchmark based on ${Werewolf}$ that enables separable measurement of deception production and detection. WOLF embeds role-grounded agents (${Villager}$, ${Werewolf}$, ${Seer}$, ${Doctor}$) in a programmable LangGraph state machine with strict night--day cycles, debate turns, and majority voting. Every statement is treated as a distinct unit of analysis, with a self-assessment of honesty from the speaker and peer assessments of deceptiveness from all others. Deception is categorized using a standardized taxonomy (${omission}$, ${distortion}$, ${fabrication}$, ${misdirection}$), while suspicion scores are aggregated longitudinally via exponential smoothing to capture both immediate judgments and evolving trust dynamics. Structured logs preserve prompts, raw outputs, and state transitions for full reproducibility and analysis. Across 100 simulated games and 7,230 analyzed statements, WOLF produces rich interaction traces, per-player deception histories, and cross-perception matrices. Results show that Werewolves generate deceptive statements in 31\% of their turns, while peer detection reaches only 71--73\% precision, with overall accuracy around 52\%. Precision is higher for detecting Werewolves, but there are more false positives against Villagers. Suspicion toward Werewolves rises from about 52\% to over 60\% across rounds, while suspicion toward Villagers and the Doctor stabilizes near 44--46\%. This growing separation demonstrates that extended interaction improves recall against liars without compounding errors against truthful roles. By coupling granular statement-level analysis with round-level suspicion trends, WOLF moves deception evaluation beyond static datasets and provides a controlled yet dynamic testbed for measuring both deceptive capacity and detective capacity in adversarial multi-agent interaction. All code utilized in this project is disclosed at https://github.com/MrinalA2009/WOLF-Werewolf-based-Observations-for-LLM-Deception-and-Falsehoods.

Submission Number: 217

Loading