StarCraft II Arena: Evaluating LLMs in Strategic Planning, Real-Time Decision Making, and Adaptability

Wenjie Tang; Yuan Zhou; Erqiang Xu; Keyan Cheng; Minne Li; Zhiyuan Wang

StarCraft II Arena: Evaluating LLMs in Strategic Planning, Real-Time Decision Making, and Adaptability

Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, Zhiyuan Wang

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark evaluation, large language model, LLM-based agent, strategic reasoning, real-time decision-making.

TL;DR: A benchmark for evaluating large language models in StarCraft II, focusing on strategic planning, real-time decision-making, and adaptability using fine-grained capability metrics and decision trace analysis.

Abstract: StarCraft II plays an important role in developing AI agents for real-time strategic reasoning due to its complex nature. However, people usually draw conclusions of how competent their agents are according to the level of the built-in agents in StarCraft II which they can win in terms of the final success rate. Little intermediate quantitative information is considered while human-in-the-loop analysis is time inefficient, which results in inadequate reflection of the true strategic reasoning ability. In this work, we propose StarCraft II Arena, a well-designed benchmark for evaluating the strategic planning, real-time decision-making, and adaptability capabilities of large language models (LLMs) agents. We introduce using fine-grained capability metrics, allowing for targeted capture and analysis of specific capability, and further propose a detailed decision trace to enhance the understanding of LLM behavior. We demonstrate the utility of such a benchmark by evaluating several state-of-the-art LLMs in various setups. Our results reveal distinct performances in long-term strategy development, real-time decision-making, and adapting to environmental changes. Such results show that the StarCraft II Arena offers a deeper insight into the decision-making process of LLMs and has the potential to become a challenging and comprehensive benchmark for strategic reasoning.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 14211

Loading