Task-Agnostic Pre-training and Task-Guided Fine-tuning for Versatile Diffusion Planner

Chenyou Fan; Chenjia Bai; Zhao Shan; Haoran He; Yang Zhang; Zhen Wang

Task-Agnostic Pre-training and Task-Guided Fine-tuning for Versatile Diffusion Planner

Chenyou Fan, Chenjia Bai, Zhao Shan, Haoran He, Yang Zhang, Zhen Wang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A paradigm for learning a versatile diffusion planner from sub-optimal transitions

Abstract: Diffusion models have demonstrated their capabilities in modeling trajectories of multi-tasks. However, existing multi-task planners or policies typically rely on task-specific demonstrations via multi-task imitation, or require task-specific reward labels to facilitate policy optimization via Reinforcement Learning (RL). They are costly due to the substantial human efforts required to collect expert data or design reward functions. To address these challenges, we aim to develop a versatile diffusion planner capable of leveraging large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks. In this paper, we propose SODP, a two-stage framework that leverages Sub-Optimal data to learn a Diffusion Planner, which is generalizable for various downstream tasks. Specifically, in the pre-training stage, we train a foundation diffusion planner that extracts general planning capabilities by modeling the versatile distribution of multi-task trajectories, which can be sub-optimal and has wide data coverage. Then for downstream tasks, we adopt RL-based fine-tuning with task-specific rewards to quickly refine the diffusion planner, which aims to generate action sequences with higher task-specific returns. Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning.

Lay Summary: Learning agents to automatically perform multiple tasks is challenging due to the high cost of collecting human demonstrations for each task or labeling existing data to distinguish between good and bad behaviors. We investigate whether a generalizable agent can be trained using low-quality trajectories without explicit labels, while still retaining the ability to adapt quickly to various downstream tasks. To address this, we develop a versatile agent based on a diffusion model, trained using a proposed two-stage training paradigm. We show that training on low-quality trajectories in the first stage can provide a broad prior over behavior patterns, even in the absence of labels, and facilitates subsequent adaptation to specific tasks in the second stage.

Primary Area: Reinforcement Learning

Keywords: reinforcement learning, diffusion models, planning

Submission Number: 3605

Loading