Counterexample Guided RL Policy Refinement Using Bayesian OptimizationDownload PDF

May 21, 2021 (edited Jan 21, 2022)NeurIPS 2021 PosterReaders: Everyone
  • Keywords: Safe Reinforcement Learning, Bayesian Optimization, Proximal Policy Optimization
  • TL;DR: A methodology to discover counter-examples from a trained RL policy and revise the policy to a safer new policy not having the counter-examples.
  • Abstract: Constructing Reinforcement Learning (RL) policies that adhere to safety requirements is an emerging field of study. RL agents learn via trial and error with an objective to optimize a reward signal. Often policies that are designed to accumulate rewards do not satisfy safety specifications. We present a methodology for counterexample guided refinement of a trained RL policy against a given safety specification. Our approach has two main components. The first component is an approach to discover failure trajectories using Bayesian optimization over multiple parameters of uncertainty from a policy learnt in a model-free setting. The second component selectively modifies the failure points of the policy using gradient-based updates. The approach has been tested on several RL environments, and we demonstrate that the policy can be made to respect the safety specifications through such targeted changes.
  • Supplementary Material: pdf
  • Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
  • Code:
17 Replies