- TL;DR: preprint for fast result
- Abstract: Learning to imitate expert behavior from demonstrations is a challenging problem, especially in environments with high-dimensional, continuous observations and unknown dynamics. The simplest methods are behavioral cloning (BC), but they suffer from the problem of distribution shift: it can shift away from demonstrated states due to accumulated errors, since the agent greedily imitates demonstrated actions. Recent methods using reinforcement learning (RL), such as generative adversarial imitation learning (GAIL) and its variants, overcome this issue by training an RL agent to match the demonstrations over a long horizon. However, they all require a brittle adversarial training process with unstable rewards. And in order to augment RL process, some other papers build a specific generative model for the expert demonstrations, which increase the model and implementation complexity significantly. In this paper, we propose to train the policy as a classifier over states in expert dataset, and attenuate distribution shift by RL with fixed rewards. Here we calculate fixed rewards, based on an energy-based model (EBM) hidden in the policy. Moreover, we train this EBM by contrastive divergence method, further regularized by contrastive representation learning. Different from adversarial learning-based methods, we use fixed rewards obtained in a simple manner. There are no extra models needed here for distribution estimation or rewards modeling, reducing the model and implementation complexity significantly. The experiments on various Atari games show its performance improvement over many previous methods.