GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision

Published: 18 Jun 2024, Last Modified: 05 Sept 2024MFM-EAI@ICML2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Imitation Learning, Instruction-Following Policy
Abstract: This paper studies the problem of learning an agent policy that can follow various forms of instructions. Specifically, we focus on multi-modal instructions: the policy is expected to accomplish tasks specified in 1) a reference video, a.k.a. one-shot demonstration; 2) a textual instruction; 3) an expected return. Canonical goal-conditioned imitation learning pipelines require strong supervision (labeled data) in the form of $\langle \tau, c\rangle$ ($\tau$ denotes a trajectory $(s_1, a_1, \dots)$ and $c$ denotes an instruction) from \textit{all} modalities, which can be hard to obtain. To this end, we propose \agent to learn from mostly unlabeled data $\tau$ plus a relatively small amount of data with strong supervision $\langle \tau, c\rangle$. The key idea is a novel algorithm to learn a shared intention space from the trajectories $\tau$ themselves and labels $c$, \ie, \textit{semi-supervised learning}. We evaluate \agent on various benchmarks including open-world Minecraft, Atari games, and robotic manipulation and it has demonstrated strong steerability and performance on these tasks.
Submission Number: 25
Loading