- Keywords: Safe Reinforcement Learning, Human-in-the-loop, Imitation Learning
- Abstract: When learning common skills like driving, beginners usually have experienced people or domain experts standing by to ensure the safety of the learning process. We formulate such learning scheme under the Expert-in-the-loop Reinforcement Learning (ERL) where a guardian is introduced to safeguard the exploration of the learning agent. While allowing the sufficient exploration in the uncertain environment, the guardian will intervene under dangerous situations and demonstrate the correct actions to avoid the potential accident. Thus ERL enables both exploration and expert's partial demonstration as two training data sources. Following such new setting, we develop a novel Expert Guided Policy Optimization (EGPO) method. This method integrates the guardian in the loop of reinforcement learning, which is composed of an expert policy to generate demonstration and a switch function to decide when to intervene. Particularly, constrained optimization technique is used to tackle the trivial solution that the agent deliberately behaves dangerously to deceive the expert into taking over all the time. Offline RL technique is further used to learn from the partial demonstrations generated by the expert. Safe driving experiments show that our method achieves superior training and test-time safety, outperforms baselines with a large margin in sample efficiency, and preserves the generalization capacity to unseen environments in test-time.
- Supplementary Material: zip