Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative SamplingDownload PDF

25 Sep 2019 (modified: 11 Mar 2020)ICLR 2020 Conference Blind SubmissionReaders: Everyone
  • Original Pdf: pdf
  • TL;DR: We introduce a notion of conservatively-extrapolated value functions, which provably lead to policies that can self-correct to stay close to the demonstration states, and learn them with a novel negative sampling technique.
  • Abstract: Imitation learning, followed by reinforcement learning algorithms, is a promising paradigm to solve complex control tasks sample-efficiently. However, learning from demonstrations often suffers from the covariate shift problem, which results in cascading errors of the learned policy. We introduce a notion of conservatively extrapolated value functions, which provably lead to policies with self-correction. We design an algorithm Value Iteration with Negative Sampling (VINS) that practically learns such value functions with conservative extrapolation. We show that VINS can correct mistakes of the behavioral cloning policy on simulated robotics benchmark tasks. We also propose the algorithm of using VINS to initialize a reinforcement learning algorithm, which is shown to outperform prior works in sample efficiency.
  • Keywords: imitation learning, model-based imitation learning, model-based RL, behavior cloning, covariate shift
9 Replies