Keywords: Cooperative Games, Multi-Turn RL, Reinforcement Learning from Verifiable Rewards
Abstract: Cooperative reasoning under incomplete information is a significant challenge for both humans and multi‑agent systems. The card game Hanabi embodies this challenge, demanding theory of mind reasoning and strategic communication. We present the largest evaluation to date of Large Language Models (LLMs) as Hanabi playing agents, assessing 17 state‑of‑the‑art LLMs in 2 to 5‑player cooperative multi-agent settings. We investigate why multi-agent coordination failures persist by systematically evaluating the impact of context engineering, from simple game state (Watson) tracking to scaffolding reasoning with explicit card deductions motivated by Bayesian inference (Sherlock) across a wide range of LLM capability (from 4B to 600B+ parameters). To our knowledge for the first time, we show 1) agents can maintain a working memory to track game state (Mycroft) instead of being explicitly provided engine deductions 2) a smooth interpolation of cross-play performance between different LLMs. In the Sherlock setting, the strongest reasoning models exceed 15 points out of 25 on average across all player counts, yet they still trail experienced human players and specialist Hanabi agents, both of which consistently score above 20. Lastly, we release the first public Hanabi datasets with move utilities and annotated game trajectories: 1) HanabiLogs: 1,520 full game logs for instruction tuning and 2) HanabiRewards: 560 games with dense move‑level value annotations (rewards) for all candidate moves. We demonstrate that supervised and RL finetuning of a 4B open-weight model (Qwen3-Instruct) on our datasets improves cooperative play on Hanabi by 21% and 156% respectively, which is within ~3 points of a strong proprietary reasoning model (o4-mini), and surpasses the best non-reasoning model (GPT-4.1) by 52%. The HanabiRewards RL finetuned model generalizes beyond Hanabi, improving performance on the recent cooperative group-guessing game benchmark by 11%, temporal reasoning on EventQA by 6.4%, instruction-following on IFBench-800K by 1.7 Pass@10, and identical mathematical reasoning Pass@10 on AIME 2025.
Primary Area: datasets and benchmarks
Submission Number: 21258
Loading