Constrained Reinforcement Learning with Smoothed Log Barrier Function

TMLR Paper5434 Authors

21 Jul 2025 (modified: 29 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deploying reinforcement learning (RL) in real-world systems often requires satisfying strict safety constraints during both training and deployment, which simple reward shaping typically fails to enforce. Existing constrained RL algorithms frequently face several major challenges, including instabilities during training and overly conservative policies. To overcome these limitations, we propose CSAC-LB (Constrained Soft Actor-Critic with Log Barrier), a model-free, sample-efficient, off-policy algorithm that requires no pre-training. CSAC-LB integrates a linear smoothed log barrier function into the actor’s objective, providing a numerically stable, non-vanishing gradient that enables the agent to quickly recover from unsafe states while avoiding the instability of traditional interior-point methods. To further enhance safety and mitigate the underestimation of constraint violations, we employ a pessimistic double-critic architecture for the cost function, taking the maximum of two cost Q-networks to conservatively guide the policy. Through extensive experiments on challenging constrained control tasks, we demonstrate that CSAC-LB significantly outperforms baselines by consistently achieving high returns while strictly adhering to safety constraints. Our results establish CSAC-LB as a robust and stable solution for applying RL to safety-critical domains.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Matteo_Papini1
Submission Number: 5434
Loading