What Matters in Employing Vision Language Models for Tokenizing Actions in Robot Control?

Nicolai Dorka; Chenguang Huang; Tim Welschehold; Wolfram Burgard

What Matters in Employing Vision Language Models for Tokenizing Actions in Robot Control?

Nicolai Dorka, Chenguang Huang, Tim Welschehold, Wolfram Burgard

Published: 05 Apr 2024, Last Modified: 29 Apr 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLM for robot control, VLM framework, low-level control

TL;DR: We present a new codebase tailored for VLMs for language-conditioned low-level robot control.

Abstract: Vision Language Models (VLMs) have demonstrated remarkable proficiency in comprehending images and text, as well as generating textual outputs based on such inputs, owing to their training on web-scale datasets. Their potential for robotics applications is particularly intriguing. One notable example is RT-2, a system capable of generating low-level actions represented in textual format from a given instruction alongside a sequence of historical actions and image observations. To stimulate further research in this domain, we introduce an open-source implementation tailored for utilizing VLMs in instruction-based robot control. This implementation supports a variety of VLM architectures and facilitates straightforward integration of new models. We use our framework to train multiple VLMs and evaluate them on a physical robot. The results validate the practical efficacy of our framework, thus paving the way for enhanced understanding and capabilities in instruction-based robot control systems. The code is available at: https://github.com/Nicolinho/RoboVLM.

Submission Number: 43

Loading