BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

Stephanie Milani; Anssi Kanervisto; Karolis Jucys; Sander V Schulhoff; Brandon Houghton; Rohin Shah

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

Stephanie Milani, Anssi Kanervisto, Karolis Jucys, Sander V Schulhoff, Brandon Houghton, Rohin Shah

Published: 18 Jun 2024, Last Modified: 05 Sept 2024MFM-EAI@ICML2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: learning from human feedback, minecraft, human evaluation, embodied agents, rlhf, demonstrations, benchmark, evaluations, multi-modal

TL;DR: To facilitate algorithm development for the BASALT benchmark, we provide a large-scale dataset of human demonstrations and evaluations, along with a streamlined codebase for training, evaluating, and analyzing algorithms.

Abstract: The MineRL BASALT competition has catalyzed advances in learning from human feedback through four hard-to-specify tasks in Minecraft, such as create and photograph a waterfall. Building on two successful years of competitions, we introduce the BASALT Evaluation and Demonstrations Dataset (BEDD), a resource for algorithm development and performance assessment. BEDD contains 26 million image-action pairs from nearly 14,000 videos of human players completing the BASALT tasks. It also includes over 3,000 dense pairwise human evaluations of both human and algorithmic agents, complete with natural language justifications for the preference assessments. Collectively, these components are designed to support the development and evaluation of multi-modal AI systems in the context of Minecraft. The code and data are available at: https://github.com/minerllabs/basalt-benchmark.

Submission Number: 20

Loading