Coherent Soft Imitation Learning

Joe Watson; Sandy Huang; Nicolas Heess

Coherent Soft Imitation Learning

Joe Watson, Sandy Huang, Nicolas Heess

Published: 20 Jul 2023, Last Modified: 08 Jun 2025EWRL16Readers: Everyone

Keywords: imitation learning, behavioural cloning, inverse reinforcement learning

TL;DR: do BC, get IRL for free by inverting soft policy iteration and deriving a shaped reward based on the BC policy

Abstract: Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward. Such methods enable agents to learn complex tasks from humans that are difficult to capture with hand-designed reward functions. Choosing BC or IRL for imitation depends on the quality and state-action coverage of the demonstrations, as well as additional access to the Markov decision process. Hybrid strategies that combine BC and IRL are not common, as initial policy optimization against inaccurate rewards diminishes the benefit of pretraining the policy with BC. This work derives an imitation method that captures the strengths of both BC and IRL. In the entropy-regularized (`soft') reinforcement learning setting, we show that the behaviour-cloned policy can be used as both a shaped reward and a critic hypothesis space by inverting the regularized policy update. This coherency facilities fine-tuning cloned policies using the reward estimate and additional interactions with the environment. This approach conveniently achieves imitation learning through initial behaviour cloning, followed by refinement via RL with online or offline data sources. The simplicity of the approach enables graceful scaling to high-dimensional and vision-based tasks, with stable learning and minimal hyperparameter tuning, in contrast to adversarial approaches.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/coherent-soft-imitation-learning/code)

1 Reply

Loading