Keywords: Cooperative Games, Multi-Turn RL, Reinforcement Learning from Verifiable Rewards
Abstract: Cooperative reasoning under incomplete information is a significant challenge
for both humans and multi- agent systems. The card game Hanabi embodies
this challenge, demanding theory of mind reasoning and strategic communication.
We present the largest evaluation to date of Large Language Models (LLMs)
as Hanabi playing agents, assessing 17 state-of - the- art LLMs in 2 to 5-player
cooperative multi-agent settings. We investigate why multi-agent coordination
failures persist by systematically evaluating the impact of context engineering, from
simple game state (Watson) tracking to scaffolding reasoning with explicit card
deductions motivated by Bayesian inference (Sherlock) across a wide range of
LLM capability (from 4B to 600B+ parameters). To our knowledge for the first
time, we show 1) agents can maintain a working memory to track game state
instead of being explicitly provided engine deductions 2) a smooth interpolation
of cross-play performance between different LLMs. In the Sherlock setting,
the strongest reasoning models exceed 15 points out of 25 on average across all
player counts, yet they still trail experienced human players and specialist Hanabi
agents, both of which consistently score above 20. Lastly, we release the first
public Hanabi datasets with move utilities and annotated game trajectories: 1)
HanabiLogs: 1,520 full game logs for instruction tuning and 2) HanabiRewards:
560 games with dense move - level value annotations (rewards) for all candidate
moves. Via instruction tuning on HanabiLogs, we show a 24% average score
improvement with Qwen3-4B-Instruct in the Sherlock setting, outperforming
powerful closed-source LLMs like GPT-4o, Claude Sonnet 3.7 and Grok-3.
Primary Area: datasets and benchmarks
Submission Number: 21258
Loading