Keywords: Imitation Learning, Deep Reinforcement Learning, Optimal Transport
TL;DR: A non-myopic method for zero-shot imitation from arbitrary offline data.
Abstract: Zero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time.
Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy.
However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long-term objectives.
We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning.
We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model.
The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks.
Submission Number: 43
Loading