Reconnaissance for reinforcement learning with safety constraints

Shin-ichi Maeda; Hayato Watahiki; Yi Ouyang; Shintarou Okada; Masanori Koyama

Reconnaissance for reinforcement learning with safety constraints

Shin-ichi Maeda, Hayato Watahiki, Yi Ouyang, Shintarou Okada, Masanori Koyama

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Reinforcement Learning, Safety constraints, Constrained Markov Decision Process

Abstract: Practical reinforcement learning problems are often formulated as constrained Markov decision process (CMDP) problems, in which the agent has to maximize the expected return while satisfying a set of prescribed safety constraints. In this study, we consider a situation in which the agent has access to the generative model which provides us with a next state sample for any given state-action pair, and propose a model to solve a CMDP problem by decomposing the CMDP into a pair of MDPs; \textit{reconnaissance} MDP (R-MDP) and \textit{planning} MDP (P-MDP). In R-MDP, we train threat function, the Q-function analogue of danger that can determine whether a given state-action pair is safe or not. In P-MDP, we train a reward-seeking policy while using a fixed threat function to determine the safeness of each action. With the help of generative model, we can efficiently train the threat function by preferentially sampling rare dangerous events. Once the threat function for a baseline policy is computed, we can solve other CMDP problems with different reward and different danger-constraint without the need to re-train the model. We also present an efficient approximation method for the threat function that can greatly reduce the difficulty of solving R-MDP. We will demonstrate the efficacy of our method over classical approaches in benchmark dataset and complex collision-free navigation tasks.

One-sentence Summary: We propose a safe RL algorithm that conducts safety-assessment and reward optimization in two separate phases by using a simulator to efficiently train a danger analogue of Q-function

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Reviewed Version (pdf): https://openreview.net/references/pdf?id=6dUGwrzCq

11 Replies

Loading