Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan; Shixiong Jiang; Mengyu Liu; Fanxin Kong

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan, Shixiong Jiang, Mengyu Liu, Fanxin Kong

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safe reinforcement learning, inverse constrained reinforcement learning, adversarial attack, vulnerability analyses

TL;DR: Explore vulnerabilities of safe reinforcement learning applications under the minimal knowledge assumption via Inverse Constrained RL

Abstract: Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose a vulnerability analysis framework for Safe RL policies via inverse constrained reinforcement learning (ICRL). Our approach only requires a set of expert demonstrations to learn both the safety constraints and a learner policy, which are then used to generate adversarial attacks capable of inducing safety violations in Safe RL policies. Theoretical analysis establishes the feasibility and provides bounds for our attack method. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach.

Primary Area: reinforcement learning

Submission Number: 11985

Loading