Constrained Diffusion Policy Optimization for Offline Reinforcement Learning

Gyeongmin Kim; Sungho Choi; Myungsik Cho; Jongseong Chae; Jeonghye Kim; Youngchul Sung

Constrained Diffusion Policy Optimization for Offline Reinforcement Learning

Gyeongmin Kim, Sungho Choi, Myungsik Cho, Jongseong Chae, Jeonghye Kim, Youngchul Sung

20 Sept 2025 (modified: 09 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, offline reinforcement learning, policy constraint, diffusion policy, diffusion model

Abstract: In this paper, we propose the two-fold improved diffusion policy (TDP) for offline reinforcement learning. We first propose the constrained diffusion policy optimization (CDPO) framework, which unifies existing diffusion-based policy constraint methods. TDP harnesses the full potential of CDPO by initializing with the closed-form solution of a constrained optimization problem and then applying another constrained policy optimization for further refinement. We establish the theoretical properties of TDP, including expected policy improvement, in-distribution property, and approximate gains over existing diffusion policies. We also propose a design method for estimating the desired policy in the TDP loss function to achieve the aforementioned performance improvements. Empirical results on the D4RL benchmark show that TDP outperforms most existing offline reinforcement learning methods.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 23299

Loading