Multimodal Diffusion Transformer for Learning from Play

Published: 21 Oct 2023, Last Modified: 04 Nov 2023LangRob @ CoRL 2023 OralEveryoneRevisionsBibTeX
Keywords: Imitation Learning, Vision-Language-Models, Diffusion Generative Models
TL;DR: A diffusion policy framework to learn language-conditioned manipulation from play data with few language labels using pretrained foundation models.
Abstract: Diffusion models have emerged as a powerful class of generative models with wide-spread adoption in many areas. They have shown surprising effectiveness, as a conditional policy representation in the context of robotic learning. This performance has led to the popularity of various frameworks, that use diffusion models to predict trajectories, action sequences or videos. Despite their prowess, existing methodologies do not adequately address learning from multimodal goal specifications, a frequent occurrence in Learning from Play (LfP) with sparse language labels. Addressing this gap, we present Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework. MDT integrates multimodal transformers, pretrained foundation models, and latent token alignment to master long-horizon manipulation based on multimodal goal specifications. Tested on the challenging CALVIN benchmark, MDT not only sets a new performance benchmark for end-to-end policies but also achieves this with less than ten percent of the training time of preceding approaches. Our experiments and ablations further validate the effectiveness and strategic choices behind MDT.
Submission Number: 47
Loading