First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

Published: 09 Oct 2024, Last Modified: 02 Dec 2024NeurIPS 2024 Workshop IMOL PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Full track
Keywords: Meta-RL, Intrinsic Motivation, RL
Abstract: Standard reinforcement learning (RL) agents never intelligently explore like a human (i.e. taking into account complex domain priors and adapting quickly to previous explorations). Across episodes, RL agents struggle to perform even simple exploration strategies, for example, systematic search that avoids exploring the same location multiple times. Meta-RL is a potential solution, as unlike standard-RL, meta-RL can *learn* to explore. We identify a new challenge with meta-RL that aims to maximize the cumulative reward of an episode sequence (cumulative-reward meta-RL). When the optimal behavior is to sacrifice reward in early episodes for better exploration (and thus enable higher later-episode rewards), existing cumulative-reward meta-RL methods become stuck on the local optima of failing to explore. We introduce a new method, First-Explore, which overcomes this limitation by learning two policies: one to solely explore, and one to solely exploit. When exploring and thus forgoing early-episode reward is required, First-Explore significantly outperforms existing cumulative meta-RL methods. By identifying and solving the previously unrecognized problem of forgoing reward in early episodes, First-Explore represents a significant step towards developing meta-RL algorithms capable of more human-like exploration on a broader range of domains. In complex or open-ended environments, this approach could allow the agent to develop sophisticated exploration heuristics that mimic intrinsic motivations (e.g., prioritizing seeking novel observations).
Submission Number: 19
Loading