Improving Safe Offline Reinforcement Learning via Dual-Guide Diffuser

Rufeng Chen; Zhejian Yang; Hechang Chen; Sihong Xie

Improving Safe Offline Reinforcement Learning via Dual-Guide Diffuser

Rufeng Chen, Zhejian Yang, Hechang Chen, Sihong Xie

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: safe reinforcement learning; diffusion model;

TL;DR: The paper proposes a diffuser-based method that leverages dual-guide to sample safe trajectories with higher rewards, achieving state-of-the-art performance.

Abstract: In offline safe reinforcement learning (OSRL), upholding safety guarantees while optimizing task performance remains a significant challenge, particularly when dealing with adaptive and time-varying constraints. While recent approaches based on actor-critic methods or generative models have made progress, they often struggle with robust safety adherence across diverse and dynamic conditions. For diffusion models specifically, a key bottleneck is the reliance on unreliable cost classifiers for safety guidance. To address these limitations, we propose SDGD (Safe Dual-Guide Diffuser), a novel framework that decouples safety and performance optimization. SDGD leverages classifier-free guidance (CFG) to strictly enforce cost constraints while simultaneously using classifier guidance (CG) to steer generation towards high-reward outcomes. This dual-guide mechanism robustly handles cost limits that change dynamically within a single episode. We provide an error bound of the estimation on reward and cost, offering performance and safety guarantees. Extensive evaluations on the DSRL benchmark demonstrate that SDGD establishes a new state-of-the-art, achieving safety in 94.7% of tasks (36/38). Crucially, our method extends the aggregate Pareto frontier in the reward-cost space, achieving a superior trade-off in the safe region.

Primary Area: reinforcement learning

Submission Number: 12181

Loading