- Keywords: Learning from Observation, Imitation Learning, Imitation Learning from Observation Alone, Intrinsic Reward, Model-based Learning
- TL;DR: This paper presents a model-based framework for Imitation Learning from Observation Alone -- the algorithm balances exploration and imitation using two self-supervised signals (learn transition dynamics and intrinsic reward for exploration)
- Abstract: This paper studies Imitation Learning from Observations alone (ILFO) where the learner is presented with expert demonstrations that only consist of states encountered by an expert (without access to actions taken by the expert). This paper presents a provably efficient model-based framework MobILE to solve the ILFO problem. MobILE uses self-supervision towards (a) training a dynamics model and (b) designing an intrinsic reward signal for exploration. Using these ideas, MobILE carefully trades off exploration against imitation by integrating the idea of optimism in the face of uncertainty into the distribution matching imitation learning (IL) framework. We provide a unified analysis for MobILE, and demonstrate that MobILE enjoys strong performance guarantees for classes of MDP dynamics that satisfy certain well studied notions of complexity. We also show that the ILFO problem is strictly harder than the standard IL problem by reducing ILFO to a multi-armed bandit problem indicating that strategic exploration is necessary for solving ILFO efficiently. We complement these theoretical results with experimental simulations on benchmark OpenAI Gym tasks that indicate the efficacy of MobILE.