Reparameterized Variational Divergence Minimization for Stable Imitation

Anonymous

Sep 25, 2019 Blind Submission readers: everyone Show Bibtex
  • TL;DR: The overall goal of this work is to enable sample-efficient imitation from expert demonstrations, both with and without the provision of expert action labels, through the use of f-divergences.
  • Abstract: State-of-the-art results in imitation learning are currently held by adversarial methods that iteratively estimate the divergence between student and expert policies and then minimize this divergence to bring the imitation policy closer to expert behavior. Analogous techniques for imitation learning from observations alone (without expert action labels), however, have not enjoyed the same ubiquitous successes. Recent work in adversarial methods for generative models has shown that the measure used to judge the discrepancy between real and synthetic samples is an algorithmic design choice, and that different choices can result in significant differences in model performance. Choices including Wasserstein distance and various $f$-divergences have already been explored in the adversarial networks literature, while more recently the latter class has been investigated for imitation learning. Unfortunately, we find that in practice this existing imitation-learning framework for using $f$-divergences suffers from numerical instabilities stemming from the combination of function approximation and policy-gradient reinforcement learning. In this work, we alleviate these challenges and offer a reparameterization of adversarial imitation learning as $f$-divergence minimization before further extending the framework to handle the problem of imitation from observations only. Empirically, we demonstrate that our design choices for coupling imitation learning and $f$-divergences are critical to recovering successful imitation policies. Moreover, we find that with the appropriate choice of $f$-divergence, we can obtain imitation-from-observation algorithms that outperform baseline approaches and more closely match expert performance in continous-control tasks with low-dimensional observation spaces. With high-dimensional observations, we still observe a significant gap with and without action labels, offering an interesting avenue for future work.
  • Keywords: Imitation Learning, Reinforcement Learning, Adversarial Learning, Learning from Demonstration
0 Replies

Loading