DITTO: Offline Imitation Learning with World Models

Branton DeMoss; Paul Duckworth; Jakob Nicolaus Foerster; Nick Hawes; Ingmar Posner

DITTO: Offline Imitation Learning with World Models

Branton DeMoss, Paul Duckworth, Jakob Nicolaus Foerster, Nick Hawes, Ingmar Posner

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Imitation Learning, Reinforcement Learning, World Models, Offline

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: A novel imitation learning approach using RL with an intrinsic reward defined in the latent space of a learned world model.

Abstract: For imitation learning algorithms to scale to real-world challenges, they must handle high-dimensional observations, offline learning, and covariate-shift. We propose DITTO, an offline imitation learning algorithm which addresses all three of these problems. DITTO does this by optimizing a novel distance measure defined in the latent space of a learned world model. We create this measure by rolling out the learned policy in the latent space of a learned world model, and compute the divergence from expert trajectories over multiple time steps. We then minimise this intrinsic reward through on-policy reinforcement learning. This approach has multiple benefits: the policy is learned under its own induced state distribution, so that we can use on-policy algorithms in the offline setting; the world model provides a natural measure of learner-expert divergence, effectively acting as an oracle to teach the learner how to recover from its mistakes; and, the world model lets us decouple learning dynamics and control, into the world model and policy respectively. DITTO is completely offline, requiring no online interactions at all. Theoretically, we show that our formulation induces a divergence bound between expert and learner, in turn bounding the difference in extrinsic reward. We test our method on standard imitation learning benchmarks, including difficult Atari environments from pixels alone, and achieve state-of-the-art performance in the offline setting. We also adapt standard imitation learning algorithms to the world model setting, and show that this considerably improves their performance and robustness.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5272

Loading