Keywords: safe reinforcement learning; diffusion model;
TL;DR: The paper proposes a diffuser-based method that leverages dual-guide to sample safe trajectories with higher rewards, achieving state-of-the-art performance.
Abstract: In offline safe reinforcement learning (OSRL), upholding safety guarantees while optimizing task performance remains a significant challenge, particularly when dealing with adaptive and time-varying constraints.
While recent approaches based on actor-critic methods or generative models have made progress, they often struggle with robust safety adherence across diverse and dynamic conditions.
For diffusion models specifically, a key bottleneck is the reliance on unreliable cost classifiers for safety guidance.
To address these limitations, we propose SDGD (Safe Dual-Guide Diffuser), a novel framework that decouples safety and performance optimization.
SDGD leverages classifier-free guidance (CFG) to strictly enforce cost constraints while simultaneously using classifier guidance (CG) to steer generation towards high-reward outcomes.
This dual-guide mechanism robustly handles cost limits that change dynamically within a single episode.
We provide an error bound of the estimation on reward and cost, offering performance and safety guarantees.
Extensive evaluations on the DSRL benchmark demonstrate that SDGD establishes a new state-of-the-art, achieving safety in 94.7% of tasks (36/38).
Crucially, our method extends the aggregate Pareto frontier in the reward-cost space, achieving a superior trade-off in the safe region.
Primary Area: reinforcement learning
Submission Number: 12181
Loading