State-Only Imitation Learning by Trajectory Distribution MatchingDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Imitation Learning, Normalising Flows, Learning from Observations, Density Models
Abstract: The best performing state-only imitation learning approaches are based on adversarial imitation learning. The main drawback, however, is that adversarial training is often unstable and lacks a reliable convergence estimator. When the true environment reward is unknown and cannot be used to select the best-performing model, this can result in bad real-world policy performance. We propose a non-adversarial learning-from-observations approach, with an interpretable convergence and performance metric. Our training objective minimizes the Kulback-Leibler divergence between the policy and expert state transition trajectories which can be optimized in a non-adversarial fashion. For this, additional density models estimate the expert state transition distribution and the environment's forward and backward dynamics. We demonstrate the effectiveness of our approach on well-known continuous control environments, where our method can generalize to expert performance. We demonstrate that our method and loss are better suited to select the best-performing policy compared to objectives from adversarial methods by being competitive to or outperforming the state-of-the-art learning-from-observation approach in these environments.
One-sentence Summary: We propose a non-adversarial learning-from-observations approach using density models to estimate environment transition distributions from the expert and the policy, resulting in an interpretable convergence and performance metric.
6 Replies

Loading