Stress-Testing Offline Reward-Free Reinforcement Learning: A Case for Planning with Latent Dynamics Models

Vlad Sobal; Wancong Zhang; Kyunghyun Cho; Randall Balestriero; Tim G. J. Rudner; Yann LeCun

Stress-Testing Offline Reward-Free Reinforcement Learning: A Case for Planning with Latent Dynamics Models

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun

Published: 28 Feb 2025, Last Modified: 02 Mar 2025WRL@ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: full paper

Keywords: Offline RL, reward-free RL, goal-conditioned RL, zero-shot RL, representation learning, dynamics learning

TL;DR: We test offline RL methods for reward free data to find what methods do best in generalizing from suboptimal data

Abstract: Reinforcement learning (RL) has enabled significant progress in controlling embodied agents. While online RL can learn complex behaviors, it is usually costly and limiting as it requires direct interactions between an agent and its environment. On the other hand, offline RL has promised to use pre-collected data to solve tasks without any direct environment interaction. In particular, zero-shot and goal-conditioned offline RL methods are even able to handle reward-free data. However, how the properties of the offline dataset influence the performance of offline RL for reward-free data remains unclear. In this work, we study how well offline RL methods for reward-free data generalize using controlled offline datasets of varying quality. We find that when given a large amount of high-quality data, model-free approaches excel but that model-based planning achieves superior performance when there is variability in the environment layouts, when solving the task requires stitching suboptimal trajectories, or when the dataset is small. Given the scarcity of high-quality, task-specific data and the abundance of suboptimal, task-agnostic trajectories in real-world scenarios, our results suggest that planning with a dynamics model is an appealing choice for zero-shot generalization from suboptimal data.

Presenter: ~Vlad_Sobal1

Format: Yes, the presenting author will definitely attend in person because they are attending ICLR for other complementary reasons.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 37

Loading