Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds

Joshua Albrecht; Abraham J Fetterman; Bryden Fogelman; Ellie Kitanidis; Bartosz Wróblewski; Nicole Seo; Michael Rosenthal; Maksis Knutins; Zachary Polizzi; James B Simon; Kanjun Qiu

Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds

Joshua Albrecht, Abraham J Fetterman, Bryden Fogelman, Ellie Kitanidis, Bartosz Wróblewski, Nicole Seo, Michael Rosenthal, Maksis Knutins, Zachary Polizzi, James B Simon, Kanjun Qiu

Published: 17 Sept 2022, Last Modified: 20 Apr 2025NeurIPS 2022 Datasets and Benchmarks Readers: Everyone

Keywords: reinforcement learning, benchmark, generalization, simulator, embodied agents, virtual reality

TL;DR: Avalon is a benchmark for generalization in RL where all individual tasks are constructed via finely controlled procedural generation of environments.

Abstract: Despite impressive successes, deep reinforcement learning (RL) systems still fall short of human performance on generalization to new tasks and environments that differ from their training. As a benchmark tailored for studying RL generalization, we introduce Avalon, a set of tasks in which embodied agents in highly diverse procedural 3D worlds must survive by navigating terrain, hunting or gathering food, and avoiding hazards. Avalon is unique among existing RL benchmarks in that the reward function, world dynamics, and action space are the same for every task, with tasks differentiated solely by altering the environment; its 20 tasks, ranging in complexity from eat and throw to hunt and navigate, each create worlds in which the agent must perform specific skills in order to survive. This setup enables investigations of generalization within tasks, between tasks, and to compositional tasks that require combining skills learned from previous tasks. Avalon includes a highly efficient simulator, a library of baselines, and a benchmark with scoring metrics evaluated against hundreds of hours of human performance, all of which are open-source and publicly available. We find that standard RL baselines make progress on most tasks but are still far from human performance, suggesting Avalon is challenging enough to advance the quest for generalizable RL.

URL: Links to datasets and code can be found at https://generallyintelligent.ai/avalon/

Dataset Url: Links to datasets and code can be found at https://generallyintelligent.ai/avalon/

Dataset Embargo: All of our data in our dataset, modifications to the Godot game engine, and other code, training scripts, and other assets will be released on or before Dec 1st (prior to the main NeurIPS conference). The reason for delaying release (from the date of submission) is that, by the nature of being a benchmark built on top of a complex simulator, it is important to release a polished, easy-to-use main version from which people can build. This later release date will provide time to incorporate feedback and additional testing to create the most useful version of our benchmark environment.

License: Our human playback dataset is released under the CC BY-SA 4.0 license: https://creativecommons.org/licenses/by-sa/4.0/ Our modifications to the Godot game engine will be released under the MIT license: https://opensource.org/licenses/MIT All other code, training scripts, and other assets released under the GPLv3 license: https://opensource.org/licenses/GPL-3.0

Author Statement: Yes

Supplementary Material: zip

Open Credentialized Access: Datasets are on Zenodo. DOI: 10.5281/zenodo.6648370

Contribution Process Agreement: Yes

In Person Attendance: Yes

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/avalon-a-benchmark-for-rl-generalization/code)

26 Replies

Loading