No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Alexander Rutherford; Michael Beukman; Timon Willi; Bruno Lacerda; Nick Hawes; Jakob Nicolaus Foerster

No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Alexander Rutherford, Michael Beukman, Timon Willi, Bruno Lacerda, Nick Hawes, Jakob Nicolaus Foerster

Published: 25 Sept 2024, Last Modified: 14 Jan 2025NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MARL, UED, Robotics

TL;DR: An improved score function for unsupervised environment design in binary outcome settings, which we use to train agents for real-world tasks, and an improved adversarial evaluation protocol that assesses policy robustness.

Abstract: What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks. This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics. Surprisingly, despite methods aiming to maximise regret in theory, the practical approximations do not correlate with regret but with success rate. As a result, a significant portion of an agent's experience comes from environments it has already mastered, offering little to no contribution toward enhancing its abilities. Put differently, current methods fail to predict intuitive measures of *learnability*. Specifically, they are unable to consistently identify those scenarios that the agent can sometimes solve, but not always. Based on our analysis, we develop a method that directly trains on scenarios with high learnability. This simple and intuitive approach outperforms existing UED methods in several binary-outcome environments, including the standard domain of Minigrid and a novel setting closely inspired by a real-world robotics problem. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability.

Primary Area: Reinforcement learning

Submission Number: 21766

Loading