Dynamics‑Aligned Diffusion Planning for Offline RL: A Unified Framework with Forward and Inverse Guidance

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion-based planning has emerged as a powerful paradigm for offline reinforcement learning (RL). However, existing approaches often overlook the physical constraints imposed by real-world dynamics, resulting in dynamics inconsistency—a mismatch between diffusion-generated trajectories and those feasible under true environment transitions. To address this issue, we propose Dynamics-Aligned Diffusion Planning (DADP), a unified framework that explicitly enforces dynamics consistency during the diffusion denoising process. DADP offers two complementary variants: DADP-F (Forward), which employs a forward dynamics model to ensure state-level feasibility, and DADP-I (Inverse), which leverages an inverse dynamics model to enhance action-level executability. Both variants share a unified guidance formulation that integrates task return optimization and dynamics alignment through gradient-based updates. Experiments on state-based D4RL Maze2D and MuJoCo benchmarks demonstrate that DADP-F and DADP-I outperform state-of-the-art offline RL baselines, effectively reducing dynamics inconsistency and improving long-horizon robustness. This unifies diffusion-based planning with physically grounded dynamics modeling.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: # Summary of Changes We have revised the manuscript to address the requirements specified by the Action Editor and the reviewers. The key changes are as follows: * **Experimental Comparison on Maze2D:** As requested, we have incorporated a comparative study with the **DPCC** baseline \citep{romer2024diffusion} on the **Maze2D** benchmark. This environment serves as our primary testbed for evaluating long-horizon "stitching" capabilities and dynamics consistency. The results have been added to **Table 1** and **Section 4.1**, demonstrating that DADP significantly outperforms DPCC in these complex navigation tasks. * **Terminology Update:** Following the Action Editor's suggestion, we have replaced the phrase "theoretical insight" with "**principled formulation**" throughout the manuscript to more accurately describe our methodological contribution. * **Efficiency and Accuracy Metrics:** We have added **Appendix E** and **Appendix F** to report the wall-clock latency analysis and the dynamics model test Mean Squared Error (MSE). We show that the DADP overhead is marginal and that the learned dynamics models achieve high predictive accuracy, ensuring the reliability of the guidance gradients. * **Mechanistic Analysis:** We have added a technical discussion in **Section 4.1** comparing our **soft guidance** approach with DPCC's **hard projection**. We explain that DADP's gradient-based energy minimization provides the necessary flexibility to balance reward maximization and physical feasibility, which is more effective for learning from suboptimal offline datasets. * **De-anonymization and Formatting:** The manuscript has been de-anonymized to include author names, affiliations, and funding acknowledgments (NSFC 62476128). The link to our official code repository is now provided. All revision-specific text coloring (e.g., blue or green text) has been removed for this final camera-ready version.
Code: https://github.com/Wzhhhh0815/Dynamics-Aligned-Diffusion-Planning
Supplementary Material: zip
Assigned Action Editor: ~Shuai_Li3
Submission Number: 6468
Loading