Correct-by-design Safety Critics using Non-contractive Binary Bellman Operators

Agustin Castellano; Hancheng Min; Juan Andres Bazerque; Enrique Mallada

Correct-by-design Safety Critics using Non-contractive Binary Bellman Operators

Agustin Castellano, Hancheng Min, Juan Andres Bazerque, Enrique Mallada

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: Safe RL, Safety-critics, Reachability analysis

Abstract: The inability to naturally enforce safety in Reinforcement Learning (RL), with limited failures, is a core challenge impeding its use in real-world applications. One notion of safety of vast practical relevance is the ability to avoid (unsafe) regions of the state space. Though such a safety goal can be captured by means of an action-value-like function, a.k.a. safety critics, the associated operator lacks the desired contraction and uniqueness properties that the classical Bellman operator enjoys. In this work, we overcome the non-contractiveness of safety critic operators by leveraging the fact that safety is a binary property. To that end, we study the properties of the binary safety critic associated with a deterministic dynamical system that seeks to avoid reaching an unsafe region. We formulate the corresponding binary Bellman equation (B2E) for safety and study its properties. While the resulting operator is still non-contractive, we provide a full characterization of the fixed points representing--except for a spurious solution--maximal persistently safe regions of the state space that can always avoid failure. Interestingly, while maximality is often a desired notion for performance, in the context of safety, it means that the learned classification boundary is dangerously close and often crosses the region where failure is unavoidable. We thus further propose a one-sided version of the B2E that allows for more robust fixed points that are non-maximal. Finally, we provide an algorithm that, by design, leverages axiomatic knowledge of safe data points to avoid spurious fixed points. We provide initial empirical validation of our theory, showing how the proposed safety critic outperforms existing solutions, particularly regarding the number of samples (and failures) needed to secure safe policies.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6073

Loading