Grounding Multimodal Large Language Models in Actions

Andrew Szot; Bogdan Mazoure; Harsh Agrawal; R Devon Hjelm; Zsolt Kira; Alexander T Toshev

Grounding Multimodal Large Language Models in Actions

Andrew Szot, Bogdan Mazoure, Harsh Agrawal, R Devon Hjelm, Zsolt Kira, Alexander T Toshev

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: Embodied AI, Multimodal Large Language Models, Reinforcement Learning, Imitation Learning

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, including both continuous and discrete actions. For continuous actions, a set of learned tokenizations that capture an action at various resolutions allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action grounding approaches on five different environments, encompassing over 114 embodied tasks.

Primary Area: Generative models

Submission Number: 19642

Loading