Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies

Tsung-Yen Yang; Justinian Rosca; Karthik R Narasimhan; Peter Ramadge

Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies

Tsung-Yen Yang, Justinian Rosca, Karthik R Narasimhan, Peter Ramadge

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Reinforcement learning with constraints, Safe reinforcement learning

Abstract: We consider the problem of reinforcement learning when provided with (1) a baseline control policy and (2) a set of constraints that the controlled system must satisfy. The baseline policy can arise from a teacher agent, demonstration data or even a heuristic while the constraints might encode safety, fairness or other application-specific requirements. Importantly, the baseline policy may be sub-optimal for the task at hand, and is not guaranteed to satisfy the specified constraints. The key challenge therefore lies in effectively leveraging the baseline policy for faster learning, while still ensuring that the constraints are minimally violated. To reconcile these potentially competing aspects, we propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint-satisfying set. We analyze the convergence of our algorithm theoretically and provide a finite-sample guarantee. In our empirical experiments on five different control tasks, our algorithm consistently outperforms several state-of-the-art methods, achieving 10 times fewer constraint violations and 40% higher reward on average.

One-sentence Summary: We propose a new algorithm that learns constraint-satisfying policies with constraint-mismatched baseline policies, and provide theoretical analysis and empirical demonstration in the context of reinforcement learning with constraints.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/accelerating-safe-reinforcement-learning-with/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=Goih-0xwOJ

13 Replies

Loading