Uncertainty-Aware Safety Propagation Critics for Safe Reinforcement Learning

TMLR Paper7469 Authors

11 Feb 2026 (modified: 02 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Safe reinforcement learning (RL) aims to optimize long-term performance while satisfying safety constraints, a requirement that is critical in many applications but difficult to guarantee when cost estimates are inaccurate or data is limited. In model-free actor-critic methods, cost critics are often unreliable in poorly explored regions, leading to constraint violations during both training and deployment. In this work, we propose a novel uncertainty-aware approach in safe RL called USPC, which constructs conservative cost surrogates using epistemic uncertainty. Our method trains an ensemble of cost critics to estimate uncertainty and uses these estimates to build an upper confidence bound on predicted costs. We then introduce a safe set network that approximates a pessimistic surrogate of the cost action-value function inspired by safe Bayesian optimization, enabling scalable safety propagation in continuous state-action spaces. Replacing standard cost critics with this surrogate in existing off-policy safe RL algorithms yields policies that are significantly less likely to violate cost constraints. We show empirically across multiple Safety Gymnasium benchmark tasks that our approach consistently reduces both the frequency and magnitude of constraint violations while maintaining competitive reward performance compared to strong baselines.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Oleg_Arenz1
Submission Number: 7469
Loading