everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
StarCraft II plays an important role in developing AI agents for real-time strategic reasoning due to its complex nature. However, people usually draw conclusions of how competent their agents are according to the level of the built-in agents in StarCraft II which they can win in terms of the final success rate. Little intermediate quantitative information is considered while human-in-the-loop analysis is time inefficient, which results in inadequate reflection of the true strategic reasoning ability. In this work, we propose StarCraft II Arena, a well-designed benchmark for evaluating the strategic planning, real-time decision-making, and adaptability capabilities of large language models (LLMs) agents. We introduce using fine-grained capability metrics, allowing for targeted capture and analysis of specific capability, and further propose a detailed decision trace to enhance the understanding of LLM behavior. We demonstrate the utility of such a benchmark by evaluating several state-of-the-art LLMs in various setups. Our results reveal distinct performances in long-term strategy development, real-time decision-making, and adapting to environmental changes. Such results show that the StarCraft II Arena offers a deeper insight into the decision-making process of LLMs and has the potential to become a challenging and comprehensive benchmark for strategic reasoning.