LLMs as Rules Oracles: Exploring Real-World Multimodal Reasoning in Tabletop Strategy Game Environments
Keywords: game reasoning, multimodal qa, vision-language grounding, benchmark, situated reasoning, tabletop games, board games
Abstract: We introduce **LudoBench**, a multimodal reasoning benchmark that evaluates whether vision-enabled large language models (LMs) can acquire, integrate, and reason over heterogeneous game knowledge in mainstream analog tabletop games. Unlike prior works that emphasize deep strategic mastery, LudoBench targets an initial reasoning challenge uninitiated gamers face: *correctly comprehending a new tabletop strategy game for the first time*.
We examine whether, given a visual depiction of a tabletop scene and a corresponding ruleset, a model can correctly answer grounded questions about the pictured scenario. Concretely, LudoBench tests three cumulative situated game-comprehension capabilities: (1) *Environment Perception*, (2) *Heterogeneous Rules Integration*, and (3) *Short-horizon Optimization*, to progressively stress-test the foundational reasoning required for real-world game comprehension.
Evaluating frontier LMs on three diverse strategy games, we find that even the strongest models achieve only ~68% accuracy on simple environment perception tasks and fall below 10% on situated multi-step comprehension puzzles that hobbyist gamers can routinely solve.
Our extensive failure analysis and knowledge-ablation experiments reveal that *models largely fail to comprehend rich cross-modal reference knowledge* and are subsequently unable to apply this knowledge to messy and unfamiliar situated environments. Our findings highlight the many steps remaining for current methods to succeed on complex multimodal reasoning in the real world.
Primary Area: datasets and benchmarks
Submission Number: 23072
Loading