Abstract: For imitation learning algorithms to scale to real-world challenges, they must handle high-dimensional observations, offline learning, and policy-induced covariate-shift. We propose DITTO, an offline imitation learning algorithm which addresses all three of these problems. DITTO optimizes a novel distance metric in the latent space of a learned world model: First, we train a world model on all available trajectory data, then, the imitation agent is unrolled from expert start states in the learned model, and penalized for its latent divergence from the expert dataset over multiple time steps. We optimize this multi-step latent divergence using standard reinforcement learning algorithms, which provably induces imitation learning, and empirically achieves state-of-the art performance and sample efficiency on a range of Atari environments from pixels, without any online environment access. We also adapt other standard imitation learning algorithms to the world model setting, and show that this considerably improves their performance. Our results show how creative use of world models can lead to a simple, robust, and highly-performant policy-learning framework.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We added language to make it clear that our proposed baseline method D-GAIL is essentially identical to VMAIL, a related method which multiple reviewers brought up. The only difference is that D-GAIL uses the stronger world model of Dreamerv2. This helps to better contextualize our work relative to a strong related approach, and clarifies that our central contribution is the latent divergence reward formulation.
We also addressed the formatting and related errors brought up by the reviewers.
Assigned Action Editor: ~Oleg_Arenz1
Submission Number: 3982
Loading