Imitation Learning via Off-Policy Distribution Matching

Ilya Kostrikov; Ofir Nachum; Jonathan Tompson

Imitation Learning via Off-Policy Distribution Matching

Ilya Kostrikov, Ofir Nachum, Jonathan Tompson

Published: 20 Dec 2019, Last Modified: 22 Jun 2025ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: reinforcement learning, deep learning, imitation learning, adversarial learning

Abstract: When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly data- inefficient or alter the original objective in a manner that can drastically change its optimum. In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective. In addition to the data-efficiency that this provides, we are able to show that this objective also renders the use of a separate RL optimization unnecessary. Rather, an imitation policy may be learned directly from this objective without the use of explicit rewards. We call the resulting algorithm ValueDICE and evaluate it on a suite of popular imitation learning benchmarks, finding that it can achieve state-of-the-art sample efficiency and performance.

Code: [![github](/images/github_icon.svg) google-research/google-research](https://github.com/google-research/google-research) + [![Papers with Code](/images/pwc_icon.svg) 2 community implementations](https://paperswithcode.com/paper/?openreview=Hyg-JC4FDr)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/imitation-learning-via-off-policy/code)

Original Pdf: pdf

11 Replies

Loading