Track: Research Track
Keywords: Safety, RL, TRPO, trust region, CMDP
Abstract: Reinforcement learning (RL) holds promise for sequential decision-making, yet real-world adoption in safety-critical settings remains limited by unsafe exploration during training. Constrained Markov Decision Processes (CMDPs) offer a formalism for safe RL, and several methods provide constraint-satisfaction guarantees, but they often fall short on empirical safety, incurring violations during training or deployment. We introduce sTRPO, which augments the traditional trust-region update with explicit exclusion of overlapping unsafe regions, thereby improving safety. sTRPO learns an auxiliary unsafe policy that estimates high-risk regions of the policy space and explicitly reduces distributional overlap with that policy during trust-region updates. Our key algorithmic novelty is a GAE-driven joint advantage: generalized advantage estimation (GAE) supplies reliable N-step local signals for return and safety, while the region-exclusion step provides global planning, stitching those local decisions into trajectories that avoid unsafe occupancy. This dual-stage optimization yields monotonic improvement in both reward and safety objectives; theoretically, we derive per-iteration bounds on worst-case safety degradation and on constraint satisfaction. Practically, the auxiliary unsafe model can be trained with any RL algorithm in simulation, making sTRPO robust and sim-to-real friendly. Empirically, on Safety-Gymnasium, sTRPO outperforms seven state-of-the-art baselines, achieving significantly fewer constraint violations while maintaining competitive task performance. Together, these results position sTRPO as a scalable, theoretically grounded framework for deploying RL in safety-critical environments.
Submission Number: 93
Loading