Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Max Wilcoxson; Qiyang Li; Kevin Frans; Sergey Levine

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Max Wilcoxson, Qiyang Li, Kevin Frans, Sergey Levine

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-labels unlabeled trajectories with optimistic rewards and high-level action labels, transforming prior data into high-level, task-relevant examples that encourage novelty-seeking behavior. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. In our experiments, SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks.

Lay Summary: How do we leverage unlabeled prior data to improve online learning and exploration of a reinforcement learning (RL) agent (a self-improving agent) in solving challenging tasks that require long-horizon reasoning? Our paper presents a method to utilize these data effectively as follows: (1) break the data into segments and turn them into a set of low-level skills that imitate these segments, and (2) determine which skills are the most appropriate to use by processing and analyzing the high-level structure of the data. These allow us to effectively learn a high-level agent that picks low-level skills at a fixed time interval during online learning. By using the high-level agent to carefully select low-level skills online, we are able to collect data in a structured manner, improving online data sample efficiency. As a result, our method is able to learn online efficiently with only limited online data and achieves strong performance on a set of 42 simulated robotic tasks compared all prior strategies.

Link To Code: https://github.com/rail-berkeley/SUPE

Primary Area: Reinforcement Learning->Online

Keywords: Offline-to-online RL, Unsupervised Pre-training, Exploration

Submission Number: 8279

Loading