Keywords: reinforcement learning, offline reinforcement learning, policy constraint, diffusion policy, diffusion model
Abstract: In this paper, we propose the two-fold improved diffusion policy (TDP) for offline reinforcement learning. We first propose the constrained diffusion policy optimization (CDPO) framework, which unifies existing diffusion-based policy constraint methods. TDP harnesses the full potential of CDPO by initializing with the closed-form solution of a constrained optimization problem and then applying another constrained policy optimization for further refinement. We establish the theoretical properties of TDP, including expected policy improvement, in-distribution property, and approximate gains over existing diffusion policies. We also propose a design method for estimating the desired policy in the TDP loss function to achieve the aforementioned performance improvements. Empirical results on the D4RL benchmark show that TDP outperforms most existing offline reinforcement learning methods.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 23299
Loading