Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL

Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL

TMLR Paper6331 Authors

28 Oct 2025 (modified: 09 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In classifier-free diffusion(CFD), the diffusion model and its guidance are typically learned jointly and applied jointly in the inference stage. Before the guidance has converged, it provides unstable or even misleading gradients, which leads to inefficiency and instability during the early stage of training. Such strict coupling not only leads to self-enforcing variance and biased errors but also prevents the guidance module from being reused across different diffusion models. We propose Guidance-First Diffusion Training (GFDT), which pretrains and freezes the guidance model before diffusion policy learning. GFDT reduces peak memory and computation by 38.1%, decreases diffusion training by 65.6% and 27.66%, and achieves up to 43.16\% and 60.98\% performance improvements on offline RL benchmarks. Beyond efficiency, we uncover a strong plug-and-play property: replacing the guidance module only at inference time can substantially improve stability. Cross-algorithm swaps (e.g., Implicit Q-Learning (IDQL) guidance for Diffusion Q-Learning (DQL) policies) perform comparably to the stronger of the two, despite never being co-trained. Our theoretical analysis shows that GFDT enables the convergence on an optimal guidance and theoretically proves that it speeds up the training. Also, we proved that plug-and-play remains valid as long as the guidance and the diffusion model are trained with the same data distribution. Limitations arising from dataset mismatch are analyzed in detail, which further underscores the necessity of distributional alignment. This work opens a new line of research by treating diffusion and guidance as modular units that can be recombined, rather than as a monolithic process, suggesting a paradigm that may guide the future development of diffusion-based reinforcement learning.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Nadav_Cohen1

Submission Number: 6331

Loading