A3VLM: Actionable Articulation-Aware Vision Language Model

Siyuan Huang; Haonan Chang; Yuhan Liu; Yimeng Zhu; Hao Dong; Abdeslam Boularias; Peng Gao; Hongsheng Li

A3VLM: Actionable Articulation-Aware Vision Language Model

Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Abdeslam Boularias, Peng Gao, Hongsheng Li

Published: 05 Sept 2024, Last Modified: 08 Nov 2024CoRL 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, VLM, Manipulation, Articulation

TL;DR: We introudce the A3VLM, focusing on the articulation structure and action affordances of objects.

Abstract: Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.

Supplementary Material: zip

Spotlight Video: mp4

Video: https://youtu.be/Fn4bn6IHRSc?feature=shared

Code: https://github.com/changhaonan/A3VLM

Publication Agreement: pdf

Student Paper: yes

Submission Number: 229

Loading