Improving Safe Offline Reinforcement Learning via Dual-Guide Diffuser

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: safe reinforcement learning; diffusion model;
TL;DR: The paper proposes a diffuser-based method that leverages dual-guide to sample safe trajectories with higher rewards, achieving state-of-the-art performance.
Abstract: In offline safe reinforcement learning (OSRL), upholding safety guarantees while optimizing task performance remains a significant challenge, particularly when dealing with adaptive and time-varying constraints. While recent approaches based on actor-critic methods or generative models have made progress, they often struggle with robust safety adherence across diverse and dynamic conditions. For diffusion models specifically, a key bottleneck is the reliance on unreliable cost classifiers for safety guidance. To address these limitations, we propose SDGD (Safe Dual-Guide Diffuser), a novel framework that decouples safety and performance optimization. SDGD leverages classifier-free guidance (CFG) to strictly enforce cost constraints while simultaneously using classifier guidance (CG) to steer generation towards high-reward outcomes. This dual-guide mechanism robustly handles cost limits that change dynamically within a single episode. We provide an error bound of the estimation on reward and cost, offering performance and safety guarantees. Extensive evaluations on the DSRL benchmark demonstrate that SDGD establishes a new state-of-the-art, achieving safety in 94.7% of tasks (36/38). Crucially, our method extends the aggregate Pareto frontier in the reward-cost space, achieving a superior trade-off in the safe region.
Primary Area: reinforcement learning
Submission Number: 12181
Loading