Parameter-efficient Tuning of Pretrained Visual-Language Models in Multitask Robot Learning

Published: 23 Oct 2023, Last Modified: 06 Nov 2023CoRL23-WS-LEAP PosterEveryoneRevisionsBibTeX
Keywords: pretrained visual-language models, multitask robot learning, adapters
TL;DR: Adapting pretrained CLIP text and visual encoders in a temporal transformer yields significant performance gains in low-resource multitask robot learning.
Abstract: Multimodal pretrained visual-language models (pVLMs) have showcased excellence across several applications, like visual question-answering. Their recent application for policy learning manifested promising avenues for augmenting robotic capabilities in the real world. This paper delves into the problem of parameter-efficient tuning of pVLMs for adapting them to robotic manipulation tasks with low-resource data. We showcase how Low-Rank Adapters (LoRA) can be injected into behavioral cloning temporal transformers to fuse language, multi-view images, and proprioception for multitask robot learning, even for long-horizon tasks. Preliminary results indicate our approach vastly outperforms baseline architectures and tuning methods, paving the way toward parameter-efficient adaptation of pretrained large multimodal transformers for robot learning with only a handful of demonstrations.
Submission Number: 25
Loading