Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies

ACL ARR 2026 January Submission4281 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Social Deduction
Abstract: Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified, multilingual, and multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction’s strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability. 2) Decision evaluation, which assesses the model’s voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models’ linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs and VLMs exhibit diverse performance, with roughly half of the models in both modalities remaining below 0.50 accuracy, revealing clear gaps in deception and counterfactual reasoning.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Resources and Evaluation, Computational Social Science, Cultural Analytics, and NLP for Social Good
Contribution Types: Data resources
Languages Studied: English, Chinese
Submission Number: 4281
Loading