Reinforcement Learning from Imperfect Demonstrations

Yang Gao, Huazhe(Harry) Xu, Ji Lin, Fisher Yu, Sergey Levine, Trevor Darrell

Feb 15, 2018 (modified: Feb 15, 2018) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Robust real-world learning should benefit from both demonstrations and interaction with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on reward from the environment. These tasks have divergent losses which are difficult to jointly optimize; further, such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. Our Normalized Actor-Critic (NAC) method can learn from demonstration data of arbitrary quality and also leverages rewards from an interactive environment. NAC learns an initial policy network from demonstration and refines the policy in a real environment. Crucially, both learning from demonstration and interactive refinement use exactly the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data, since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.
  • Keywords: learning from demonstration, reinforcement learning, maximum entropy learning