Abstract: Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-labels unlabeled trajectories with optimistic rewards and high-level action labels, transforming prior data into high-level, task-relevant examples that encourage novelty-seeking behavior. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. In our experiments, SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks.
Lay Summary: How do we leverage unlabeled prior data to improve online learning and exploration of a reinforcement learning (RL) agent (a self-improving agent) in solving challenging tasks that require long-horizon reasoning? Our paper presents a method to utilize these data effectively as follows: (1) break the data into segments and turn them into a set of low-level skills that imitate these segments, and (2) determine which skills are the most appropriate to use by processing and analyzing the high-level structure of the data. These allow us to effectively learn a high-level agent that picks low-level skills at a fixed time interval during online learning. By using the high-level agent to carefully select low-level skills online, we are able to collect data in a structured manner, improving online data sample efficiency. As a result, our method is able to learn online efficiently with only limited online data and achieves strong performance on a set of 42 simulated robotic tasks compared all prior strategies.
Link To Code: https://github.com/rail-berkeley/SUPE
Primary Area: Reinforcement Learning->Online
Keywords: Offline-to-online RL, Unsupervised Pre-training, Exploration
Submission Number: 8279
Loading