Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Published: 05 Apr 2024, Last Modified: 28 Apr 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Imitation Learning, In-Context Learning, Large Pretrained Models, Foundation Models
TL;DR: We repurpose text-pretrained Transformers as sequence-to-sequence imitation learning machines, mapping visual inputs to action outputs via our proposed Keypoint Action Tokens framework.
Abstract: We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. To do so, we transform visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained on language, these models excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art techniques in the low-data regime. Rather than operating in the language domain, KAT leverages text-based Transformers to operate in vision and action domains for efficient general imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks.
Supplementary Material: zip
Submission Number: 29
Loading