Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations

Haoran Xu; Xianyuan Zhan; Honglei Yin; Huiling Qin

Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations

Haoran Xu, Xianyuan Zhan, Honglei Yin, Huiling Qin

12 Oct 2021 (modified: 04 May 2025)Deep RL Workshop NeurIPS 2021Readers: Everyone

Keywords: imitation learning, offline imitation learning

TL;DR: We propose a new imitation learning algorithm that can learn from suboptimal demonstrations without environment interactions or expert annotations.

Abstract: We study the problem of offline Imitation Learning (IL) where an agent aims to learn an optimal expert behavior policy without additional online environment interactions. Instead, the agent is provided with a static offline dataset of state-action-next state transition triples from both optimal and non-optimal expert behaviors. This strictly offline imitation learning problem arises in many real-world problems, where environment interactions and expert annotations are costly. Prior works that address the problem either require that expert data occupies the majority proportion of the offline dataset, or need to learn a reward function and perform offline reinforcement learning (RL) based on the learned reward function. In this paper, we propose an imitation learning algorithm to address the problem without additional steps of reward learning and offline RL training for the case when demonstrations containing large-proportion of suboptimal data. Built upon behavioral cloning (BC), we introduce an additional discriminator to distinguish expert and non-expert data, we propose a cooperation strategy to boost the performance of both tasks, this will result in a new policy learning objective and surprisingly, we find its equivalence to a generalized BC objective, where the outputs of discriminator serve as the weights of the BC loss function. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than policies learned by baseline algorithms.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/discriminator-weighted-offline-imitation/code)

0 Replies

Loading