Learning from the Right Mistakes: When Do Low-Performing Data Help Offline Policy Gradients?

Jesse Silverberg; Glen Berseth; Marc G Bellemare

Learning from the Right Mistakes: When Do Low-Performing Data Help Offline Policy Gradients?

Jesse Silverberg, Glen Berseth, Marc G Bellemare

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, offline reinforcement learning, policy gradient, importance sampling

TL;DR: The structure of low-performing data affects how much offline RL algorithms can learn from them

Abstract: Imitation learning approaches currently reign supreme for robotic manipulation, as value-based offline reinforcement learning (RL) approaches have not yet proven successful at scaling to state-of-the-art large models. As a result, most robotics training pipelines will filter out suboptimal data or include them in supervised training objectives. Motivated by recent successes in post-training for large language models, we investigate the viability of policy gradient algorithms with importance sampling as a method of learning from subpar data. We show that with certain dataset compositions common in practical settings, such algorithms can extract useful additional learning signal from low-performing data. We also find that such approaches are not able to extract useful signal from low-performing data within datasets that are formed from the replay buffers of agents trained with RL, a dataset composition that is prevalent in the offline RL literature but rare in the real world. Our results point to the importance of considering the interplay between dataset composition and offline RL algorithm design.

Submission Number: 125

Loading