Safety-Biased Policy Optimisation: Towards Hard-Constrained Reinforcement Learning via Trust Regions

Safety-Biased Policy Optimisation: Towards Hard-Constrained Reinforcement Learning via Trust Regions

ICLR 2026 Conference Submission10589 Authors

18 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Safe Reinforcement Learning, Model-free Reinforcement Learning, Model-free Safe Reinforcement Learning

TL;DR: We propose SB-TRPO, which adaptively biases natural policy gradients toward constraint satisfaction while still seeking reward improvement. SB-TRPO achieves competitive rewards with low costs on Safety Gymnasium tasks.

Abstract: Reinforcement learning (RL) in safety-critical domains requires agents to maximise rewards while strictly adhering to safety constraints. Existing approaches, such as Lagrangian and projection-based methods, often either fail to ensure near-zero safety violations or sacrifice reward performance in the face of hard constraints. We propose Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a new trust-region algorithm for hard-constrained RL. SB-TRPO adaptively biases policy updates toward constraint satisfaction while still seeking reward improvement. Concretely, it performs trust-region updates using a convex combination of the natural policy gradients of cost and reward, ensuring a fixed fraction of optimal cost reduction at each step. We provide a theoretical guarantee of local progress toward safety, with reward improvement when gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks show that SB-TRPO consistently achieves the best balance of safety and meaningful task completion compared to state-of-the-art methods.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 10589

Loading