Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces
TL;DR: This is the first paper that achieves sub-linear regret with zero violation for linear MDP with non-convex feature space
Abstract: In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of $\tilde{\mathcal{O}}((1 + \tfrac{1}{\tau}) \sqrt{\log(\frac{1}{\tau}) d^3 H^4 K})$, applicable to both star-convex and non-star-convex cases, where $d$ is the feature dimension, $H$ the episode length, $K$ the number of episodes, and $\tau$ the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called *Objective–Constraint Decomposition* (OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.
Lay Summary: Self-driving cars and robots must learn from experience, yet even a single crash during training is unacceptable. We introduce a two-stage strategy that behaves like a student driver: it begins by cruising cautiously on quiet streets to map out what is safe, then, once confident, moves on to explore actions while still being safe. Despite this cautious start, it learns to explore nearly as efficiently as a risk-taking learner. We prove mathematically that the risk of an accident remains near zero throughout training while achieving (nearly) optimal regret. Our analysis introduces a new mathematical tool to handle hard safety scenarios and also corrects a flaw in earlier research. In our simulated driving tests, the system completed every route without a single collision, while nearly matching the performance of unsafe safe approaches, validating our theoretical insights.
Primary Area: Theory->Learning Theory
Keywords: Reinforcement Learning, Episodic Linear MDP, Constrained RL, Safe RL, Non-Convex RL, Covering number
Submission Number: 7744
Loading