Parameter-efficient Tuning of Pretrained Visual-Language Models in Multitask Robot Learning

Marcel Mittenbuehler; Ahmed Hendawy; Carlo D'Eramo; Georgia Chalvatzaki

Parameter-efficient Tuning of Pretrained Visual-Language Models in Multitask Robot Learning

Marcel Mittenbuehler, Ahmed Hendawy, Carlo D'Eramo, Georgia Chalvatzaki

Published: 23 Oct 2023, Last Modified: 06 Nov 2023CoRL23-WS-LEAP PosterEveryoneRevisionsBibTeX

Keywords: pretrained visual-language models, multitask robot learning, adapters

TL;DR: Adapting pretrained CLIP text and visual encoders in a temporal transformer yields significant performance gains in low-resource multitask robot learning.

Abstract: Multimodal pretrained visual-language models (pVLMs) have showcased excellence across several applications, like visual question-answering. Their recent application for policy learning manifested promising avenues for augmenting robotic capabilities in the real world. This paper delves into the problem of parameter-efficient tuning of pVLMs for adapting them to robotic manipulation tasks with low-resource data. We showcase how Low-Rank Adapters (LoRA) can be injected into behavioral cloning temporal transformers to fuse language, multi-view images, and proprioception for multitask robot learning, even for long-horizon tasks. Preliminary results indicate our approach vastly outperforms baseline architectures and tuning methods, paving the way toward parameter-efficient adaptation of pretrained large multimodal transformers for robot learning with only a handful of demonstrations.

Submission Number: 25

Loading