Keywords: Cooperative Game, Multi-turn RL, Dataset
Abstract: Cooperative reasoning under incomplete information is a significant challenge for both humans and multi‑agent AI. The card game Hanabi embodies this challenge, demanding theory of mind reasoning and strategic communication. We present the largest evaluation to date of Large Language Models (LLMs) as Hanabi playing agents, assessing 17 state‑of‑the‑art LLMs in 2 to 5‑player cooperative multi-agent settings. Agents were provided a minimal “MinCon” prompt and a context-rich “DeductCon” prompt that scaffolds reasoning with explicit card deductions motivated by Bayesian inference and strategic guidance, revealing that different prompts induced fundamentally different gameplay strategies. With the DeductCon prompt, the strongest reasoning models exceed 15 points out of 25 on average across all player counts, yet they still trail experienced human players and purpose‑built RL agents, both of which consistently score above 20. We perform systematic ablations with context engineering, Best‑of‑K sampling, and multi‑agent scaffolding to reveal when context helps, when sampling hurts, and why multi-agent coordination failures persist. To encourage further research in multi-agent play for Hanabi, we release two resources: (1) 1,520 full game logs for instruction tuning and (2) 560 games with dense move‑level value annotations (rewards) for all candidate moves to enable Reinforcement Learning from AI Feedback (RLAIF) in cooperative settings.
Submission Number: 180
Loading