Uncertainty-Aware Safety Propagation Critics for Safe Reinforcement Learning

Uncertainty-Aware Safety Propagation Critics for Safe Reinforcement Learning

11 Feb 2026 (modified: 12 May 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Safe reinforcement learning (RL) aims to optimize long-term performance while satisfying safety constraints, a requirement that is critical in many applications but difficult to guarantee when cost estimates are inaccurate or data is limited. In model-free actor-critic methods, cost critics are often unreliable in poorly explored regions, leading to constraint violations during both training and deployment. In this work, we propose a novel uncertainty-aware approach in safe RL called USPC, which constructs conservative cost surrogates using epistemic uncertainty. Our method trains an ensemble of cost critics to estimate uncertainty and uses these estimates to build an upper confidence bound on predicted costs. We then introduce a safe set network that approximates a pessimistic surrogate of the cost action-value function inspired by safe Bayesian optimization, enabling scalable safety propagation in continuous state-action spaces. Replacing standard cost critics with this surrogate in existing off-policy safe RL algorithms yields policies that are less likely to violate cost constraints. We show empirically across multiple Safety Gymnasium benchmark tasks that our approach reduces both the frequency and magnitude of constraint violations in most tasks while maintaining competitive reward performance compared to the baselines.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Oleg_Arenz1

Submission Number: 7469

Loading