Keywords: LLM, VLM, Manipulation, Articulation
TL;DR: We introudce the A3VLM, focusing on the articulation structure and action affordances of objects.
Abstract: Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.
Supplementary Material: zip
Spotlight Video: mp4
Video: https://youtu.be/Fn4bn6IHRSc?feature=shared
Code: https://github.com/changhaonan/A3VLM
Publication Agreement: pdf
Student Paper: yes
Submission Number: 229
Loading