Diffusion-Guided Safe Policy Optimization From Cost-Label-Free Offline Dataset

Feng Chen; Zhilong Zhang; Jiacheng Xu; Lei Yuan; Cong Guan; Zongzhang Zhang; Yang Yu

Diffusion-Guided Safe Policy Optimization From Cost-Label-Free Offline Dataset

Feng Chen, Zhilong Zhang, Jiacheng Xu, Lei Yuan, Cong Guan, Zongzhang Zhang, Yang Yu

28 Sept 2024 (modified: 22 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Offline Safe Reinforcement Learning, Diffusion Model

TL;DR: We propose a problem setup for learning safe policies from offline data without cost labels, and present a two-stage policy optimization solution.

Abstract: Offline safe reinforcement learning (RL) aims to guarantee the safety of decision-making in both training and deployment phases by learning the safe policy entirely from offline data without further interaction with the environment, which pushes the RL towards real-world applications. Previous efforts in offline safe RL typically presume the presence of Markovian costs within the dataset. However, the design of a Markovian cost function involves rehearsal of all potentially unsafe cases, which is inefficient and even unfeasible in many practical tasks. In this work, we take a further step forward by learning a safe policy from an offline dataset without any cost labels, but with a small number of safe demonstrations included. To solve this problem, we propose a two-stage optimization method called **D**iffusion-guided **S**afe **P**olicy **O**ptimization (**DSPO**). Initially, we derive trajectory-wise safety signals by training a return-agnostic discriminator. Subsequently, we train a conditional diffusion model that generates trajectories conditioned both on the trajectory return and the safety signal. Remarkably, the trajectories generated by our diffusion model not only yield high returns but also comply with the safety signals, from which we can derive a desirable policy through behavior cloning (BC). The evaluation experiments conducted across tasks from the SafetyGym, BulletGym, and MetaDrive environments demonstrate that our approach can achieve a safe policy with high returns, significantly outperforming various established baselines.

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13512

Loading