Discovering Minimal Reinforcement Learning Environments

Jarek Luca Liesen; Chris Lu; Andrei Lupu; Jakob Nicolaus Foerster; Henning Sprekeler; Robert Tjarko Lange

Discovering Minimal Reinforcement Learning Environments

Jarek Luca Liesen, Chris Lu, Andrei Lupu, Jakob Nicolaus Foerster, Henning Sprekeler, Robert Tjarko Lange

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Reinforcement Learning, Meta-Learning, Evolution Strategies, Environments

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We meta-optimize neural networks representing contextual bandits that rapidly train agents and broadly generalize.

Abstract: Human agents often acquire skills under conditions that are significantly different from the context in which the skill is needed. For example, students prepare for an exam not by taking it, but by studying books or supplementary material. Can artificial agents benefit from training outside of their evaluation environment as well? In this project, we develop a novel meta-optimization framework to discover neural network-based synthetic environments. We find that training contextual bandits suffices to train Reinforcement Learning agents that generalize well to their evaluation environment, eliminating the need to meta-learn a transition function. We show that the synthetic contextual bandits train Reinforcement Learning agents in a fraction of time steps and wall clock time, and generalize across hyperparameter settings and algorithms. Using our method in combination with a curriculum on the performance evaluation horizon, we are able to achieve competitive results on a number of challenging continuous control problems. Our approach opens a multitude of new research directions: Contextual bandits are easy to interpret, yielding insights into the tasks that are encoded by the evaluation environment. Additionally, we demonstrate that synthetic environments can be used in downstream meta-learning setups, derive a new policy from the differentiable reward function, and show that the synthetic environments generalize to entirely different optimization settings.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7108

Loading