Safety-Biased Policy Optimisation: Towards Hard-Constrained Reinforcement Learning via Trust Regions
Keywords: Reinforcement Learning, Safe Reinforcement Learning, Model-free Reinforcement Learning, Model-free Safe Reinforcement Learning
TL;DR: We propose SB-TRPO, which adaptively biases natural policy gradients toward constraint satisfaction while still seeking reward improvement. SB-TRPO achieves competitive rewards with low costs on Safety Gymnasium tasks.
Abstract: Reinforcement learning (RL) in safety-critical domains requires agents to maximise rewards while strictly adhering to safety constraints. Existing approaches, such as Lagrangian and projection-based methods, often either fail to ensure near-zero safety violations or sacrifice reward performance in the face of hard constraints. We propose Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a new trust-region algorithm for hard-constrained RL. SB-TRPO adaptively biases policy updates toward constraint satisfaction while still seeking reward improvement. Concretely, it performs trust-region updates using a convex combination of the natural policy gradients of cost and reward, ensuring a fixed fraction of optimal cost reduction at each step. We provide a theoretical guarantee of local progress toward safety, with reward improvement when gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks show that SB-TRPO consistently achieves the best balance of safety and meaningful task completion compared to state-of-the-art methods.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 10589
Loading