- Keywords: Safe Reinforcement Learning, Bayesian Optimization, Proximal Policy Optimization
- TL;DR: A methodology to discover counter-examples from a trained RL policy and revise the policy to a safer new policy not having the counter-examples.
- Abstract: Constructing Reinforcement Learning (RL) policies that adhere to safety requirements is an emerging field of study. RL agents learn via trial and error with an objective to optimize a reward signal. Often policies that are designed to accumulate rewards do not satisfy safety specifications. We present a methodology for counterexample guided refinement of a trained RL policy against a given safety specification. Our approach has two main components. The first component is an approach to discover failure trajectories using Bayesian optimization over multiple parameters of uncertainty from a policy learnt in a model-free setting. The second component selectively modifies the failure points of the policy using gradient-based updates. The approach has been tested on several RL environments, and we demonstrate that the policy can be made to respect the safety specifications through such targeted changes.
- Supplementary Material: pdf
- Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
- Code: https://github.com/britig/policy-refinement-bo
17 Replies
Loading