Safe Online Learning via Smooth Safety-Structured Policy Composition

Safe Online Learning via Smooth Safety-Structured Policy Composition

TMLR Paper8997 Authors

17 May 2026 (modified: 26 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Safe online reinforcement learning requires policies to respect safety constraints while maintaining smooth optimization dynamics. Existing approaches typically rely on either strict safety enforcement via action interventions, which introduce discontinuities in system interaction and learning, or soft safety constraint formulations, which preserve smooth learning but provide limited safety assurance. We propose AutoSafe, a safety-aware policy architecture that integrates structured safety monitoring and intervention directly into the action generation process. This design enables smooth, risk-dependent transitions between performance-driven and safety-preserving behaviors, resulting in continuous online interaction and learning dynamics. Empirical results across a suite of continuous-control benchmarks demonstrate strong safety enforcement without sacrificing learning smoothness. We further validate AutoSafe on a physical cart-pole system, highlighting its practical effectiveness for safe online learning in the real world.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Jacek_Cyranka1

Submission Number: 8997

Loading