Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces

Amirhossein Roknilamouki, Arnob Ghosh, Ming Shi, Fatemeh Nourzad, Eylem Ekici, Ness Shroff

Published: 31 May 2025, Last Modified: 14 Aug 2025ICML'2025EveryoneRevisionsCC BY 4.0

Abstract: In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form, and the state-space may be large. In this paper, we establish a regret bound of \(\tilde{\mathcal{O}}\bigl(\bigl(1 + \tfrac{1}{\tau}\bigr) \sqrt{\log\bigl(\tfrac{1}{\tau}\bigr) d^3 H^4 K} \bigr)\), applicable to both star-convex and non-star-convex cases, where \(d\) is the feature dimension, \(H\) the episode length, \(K\) the number of episodes, and \(\tau\) the safety threshold for a linear MDP setting. Moreover, the violation of safety constraints is {\em zero} with a high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called \textit{Objective–Constraint Decomposition} (OCD) to properly bound the covering number, and resolves an error in a previous work on the constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.