Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport

Mingyang Sun; Pengxiang Ding; Weinan Zhang; Donglin Wang

Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport

Mingyang Sun, Pengxiang Ding, Weinan Zhang, Donglin Wang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and long-term planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR's superior performance and robustness compared to existing methods, especially in complex and sparse-reward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning.

Lay Summary: Reinforcement learning (RL) algorithms often struggle to balance exploration and efficiency when adapting to complex environments, limiting their applicability. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose a new method, OTPR, that combines score-based diffusion models with optimal transport theory to uses trial-and-error learning to fine-tune policies, ensuring robots dynamically adjust to surprises while avoiding wasted effort. Our method establishes a mathematical connection between trial-and-error learning and optimal transport planning, where the Q-function serves as transport cost and the policy operates as an optimal transport map. Moreover, we introduce masked optimal transport to guide state-action matching using expert key-points and a compatibility-based resampling strategy to enhance training stability. Tests in virtual environments show OTPR outperforms existing methods, especially in complex tasks with sparse rewards, paving the way for robots that learn reliably in dynamic real-world settings.

Primary Area: Reinforcement Learning->Deep RL

Keywords: Diffusion Policy, Reinforcement Learning, Optimal Transport

Submission Number: 15847

Loading