Abstract: Large language models have demonstrated remarkable few-shot performance on
many natural language understanding tasks. Despite several demonstrations of
using large language models in complex, strategic scenarios, there lacks a com-
prehensive framework for evaluating agents’ performance across various types
of reasoning found in games. To address this gap, we introduce GAMEBENCH,
a cross-domain benchmark for evaluating strategic reasoning abilities of LLM
agents. We focus on 9 different game environments, where each covers at least
one axis of key reasoning skill identified in strategy games, and select games for
which strategy explanations are unlikely to form a significant portion of models’
pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form
along with two scaffolding frameworks designed to enhance strategic reasoning
ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP).
Our results show that none of the tested models match human performance, and
at worst GPT-4 performs worse than random action. CoT and RAP both im-
prove scores but not to comparable human levels. Benchmark code is available at
https://github.com/Joshuaclymer/GameBench
Loading