Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents

Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents

ICLR 2026 Conference Submission21258 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cooperative Games, Multi-Turn RL, Reinforcement Learning from Verifiable Rewards

Abstract: Cooperative reasoning under incomplete information is a significant challenge for both humans and multi- agent systems. The card game Hanabi embodies this challenge, demanding theory of mind reasoning and strategic communication. We present the largest evaluation to date of Large Language Models (LLMs) as Hanabi playing agents, assessing 17 state-of - the- art LLMs in 2 to 5-player cooperative multi-agent settings. We investigate why multi-agent coordination failures persist by systematically evaluating the impact of context engineering, from simple game state (Watson) tracking to scaffolding reasoning with explicit card deductions motivated by Bayesian inference (Sherlock) across a wide range of LLM capability (from 4B to 600B+ parameters). To our knowledge for the first time, we show 1) agents can maintain a working memory to track game state instead of being explicitly provided engine deductions 2) a smooth interpolation of cross-play performance between different LLMs. In the Sherlock setting, the strongest reasoning models exceed 15 points out of 25 on average across all player counts, yet they still trail experienced human players and specialist Hanabi agents, both of which consistently score above 20. Lastly, we release the first public Hanabi datasets with move utilities and annotated game trajectories: 1) HanabiLogs: 1,520 full game logs for instruction tuning and 2) HanabiRewards: 560 games with dense move - level value annotations (rewards) for all candidate moves. Via instruction tuning on HanabiLogs, we show a 24% average score improvement with Qwen3-4B-Instruct in the Sherlock setting, outperforming powerful closed-source LLMs like GPT-4o, Claude Sonnet 3.7 and Grok-3.

Primary Area: datasets and benchmarks

Submission Number: 21258

Loading