GAMEBOT: Gaming Arena for Model Evaluation - Battle of Tactics

Wenye Lin; Jonathan Roberts; Yunhan Yang; Samuel Albanie; Zongqing Lu; Kai Han

GAMEBOT: Gaming Arena for Model Evaluation - Battle of Tactics

Wenye Lin, Jonathan Roberts, Yunhan Yang, Samuel Albanie, Zongqing Lu, Kai Han

24 Sept 2024 (modified: 04 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM evaluation, benchmark, competitive game

TL;DR: We introduce a reliable LLM benchmark in gaming through evaluating the intermediate results as well as the final decisions.

Abstract: Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, we require robust benchmarks to evaluate their capabilities beyond superficial pattern recognition. However, existing benchmarks either suffer from data contamination or lack legibility. In this paper, we introduce GAMEBOT, a novel benchmark for evaluating LLMs in competitive gaming environments that addresses these limitations. GAMEBOT decomposes complex reasoning in games into modular subproblems, targeting abilities like rule understanding and strategy instruction following. We develop Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs and automatically validate their intermediate reasoning steps against ground truth. This approach allows us to assess not only the accuracy of final decisions but also the quality of the underlying reasoning process. We benchmark 17 prominent LLMs across eight diverse games, encompassing various strategic abilities and game characteristics. GAMEBOT offers four advantages: (1) Mitigation of Data Contamination: Dynamic game states minimize overlap with pre-training data. (2) Legibility: Evaluation of intermediate reasoning steps enables fine-grained scrutiny of LLM behavior. (3) Difficulty: The games effectively differentiate top-performing models. (4) Stronger Baselines: Our curated CoT prompts establish competitive baselines for future research. We hope GAMEBOT stimulates further work that seeks a deeper understanding of LLM reasoning capabilities in strategic settings.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3782

Loading