State-Only Imitation Learning by Trajectory Distribution Matching

Damian Boborzi; Christoph-Nikolas Straehle; Jens Stefan Buchner; Lars Mikelsons

State-Only Imitation Learning by Trajectory Distribution Matching

Damian Boborzi, Christoph-Nikolas Straehle, Jens Stefan Buchner, Lars Mikelsons

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Imitation Learning, Normalising Flows, Learning from Observations, Density Models

Abstract: The best performing state-only imitation learning approaches are based on adversarial imitation learning. The main drawback, however, is that adversarial training is often unstable and lacks a reliable convergence estimator. When the true environment reward is unknown and cannot be used to select the best-performing model, this can result in bad real-world policy performance. We propose a non-adversarial learning-from-observations approach, with an interpretable convergence and performance metric. Our training objective minimizes the Kulback-Leibler divergence between the policy and expert state transition trajectories which can be optimized in a non-adversarial fashion. For this, additional density models estimate the expert state transition distribution and the environment's forward and backward dynamics. We demonstrate the effectiveness of our approach on well-known continuous control environments, where our method can generalize to expert performance. We demonstrate that our method and loss are better suited to select the best-performing policy compared to objectives from adversarial methods by being competitive to or outperforming the state-of-the-art learning-from-observation approach in these environments.

One-sentence Summary: We propose a non-adversarial learning-from-observations approach using density models to estimate environment transition distributions from the expert and the policy, resulting in an interpretable convergence and performance metric.

6 Replies

Loading