MCU: An Evaluation Framework for Open-Ended Game Agents

Xinyue Zheng; Haowei Lin; Kaichen He; Zihao Wang; QIANG FU; Haobo Fu; Zilong Zheng; Yitao Liang

MCU: An Evaluation Framework for Open-Ended Game Agents

Xinyue Zheng, Haowei Lin, Kaichen He, Zihao Wang, QIANG FU, Haobo Fu, Zilong Zheng, Yitao Liang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Abstract: Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce \textit{Minecraft Universe} (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5\% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments. Our evaluation code and scripts are available at https://github.com/CraftJarvis/MCU.

Lay Summary: Developing AI agents that can handle open-ended tasks in dynamic environments, like video games, is a major challenge in artificial intelligence. However, evaluating these agents is difficult due to the lack of scalable and diverse benchmarks. To address this, we introduce *Minecraft Universe* (MCU), a comprehensive evaluation framework set in the popular game Minecraft. MCU includes thousands of customizable tasks, ranging from simple actions like mining resources to complex challenges like building structures or crafting items. It also features an automated evaluation system that aligns closely with human judgments, making it efficient and reliable. Our experiments show that even the most advanced AI agents struggle with the diversity and complexity of these tasks, highlighting the need for robust benchmarks like MCU to drive progress in AI development. By providing a standardized testing ground, MCU helps researchers create more adaptable and intelligent agents, bringing us closer to AI that can navigate real-world unpredictability.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/CraftJarvis/MCU

Primary Area: General Machine Learning->Evaluation

Keywords: benchmark, automatic evaluation, agent

Submission Number: 8831

Loading