Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach

Published: 22 Apr 2024, Last Modified: 29 Apr 2024VLADR 2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotic Manipulation; Rearrangement; Vision Language Model; 3D Perception
Abstract: The integration of large-scale Vision-Language Models (VLMs) with embodied AI can greatly enhance the generalizability and capacity to follow open instructions for robots. However, existing studies on object rearrangement are not up to full consideration of the 6-DoF requirements, let alone establishing a comprehensive benchmark. In this paper, we propel the pioneering construction of the benchmark and approach for table-top Open-instruction 6-DoF Object Rearrangement (Open6DOR). Specifically, we collect a synthetic dataset of 200+ objects and carefully design 2400+ Open6DOR tasks. These tasks are divided into the Position-track, Rotation-track, and 6-DoF-track for evaluating different embodied agents in predicting the positions and rotations of target objects. Besides, we also propose a VLM-based approach for Open6DOR, named Open6DOR-GPT, which empowers GPT-4V with 3D-awareness and simulation-assistance and exploits its strengths in generalizability and instruction-following for this task. We compare the existing embodied agents with our Open6DOR-GPT on the proposed Open6DOR benchmark and find that Open6DOR-GPT achieves state-of-the-art performance. We further show the impressive performance of Open6DOR-GPT in diverse real-world experiments. Our constructed benchmark and method will be released upon paper acceptance.
Supplementary Material: zip
Submission Number: 23
Loading