Guiding Offline Reinforcement Learning Using Safety Expert

Richa Verma; Kartik Bharadwaj; Harshad Khadilkar; Balaraman Ravindran

Guiding Offline Reinforcement Learning Using Safety Expert

Richa Verma, Kartik Bharadwaj, Harshad Khadilkar, Balaraman Ravindran

05 Oct 2022 (modified: 05 May 2023)Offline RL Workshop NeurIPS 2022Readers: Everyone

Keywords: Offline RL, Safety through transfer of knowledge in Offline RL

TL;DR: We quantify the state uncertainty based on how frequently they appear in a training dataset. and in states with high uncertainty, the offline RL agent mimics the safety expert while maximizing the long-term reward.

Abstract: Offline reinforcement learning is used to train policies in situations where it is expensive or infeasible to access the environment during training. An agent trained under such a scenario does not get corrective feedback once the learned policy starts diverging and may fall prey to the overestimation bias commonly seen in this setting. This increases the chances of the agent choosing unsafe/risky actions, especially in states with sparse to no representation in the training dataset. In this paper, we propose to leverage a safety expert to discourage the offline RL agent from choosing unsafe actions in under-represented states in the dataset. The proposed framework in this paper transfers the safety expert's knowledge in an offline setting for states with high uncertainty to prevent catastrophic failures from occurring in safety-critical domains. We use a simple but effective approach to quantify the state uncertainty based on how frequently they appear in a training dataset. In states with high uncertainty, the offline RL agent mimics the safety expert while maximizing the long-term reward. We modify TD3+BC, an existing offline RL algorithm, as a part of the proposed approach. We demonstrate empirically that our approach performs better than TD3+BC on some control tasks and comparably on others across two sets of benchmark datasets while reducing the chance of taking unsafe actions in sparse regions of the state space.

2 Replies

Loading