RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song; Valts Blukis; Jonathan Tremblay; Stephen Tyree; Yu Su; Stan Birchfield

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, Stan Birchfield

Published: 28 Feb 2025, Last Modified: 17 Apr 2025WRL@ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: full paper

Keywords: spatial, spatialunderstanding, spatialreasoning, mllm, vlm, lmm, manipulation, 2d, 3d, robotics

TL;DR: Our work introduces RoboSpatial, a large-scale 2D/3D spatial understanding dataset for robotics, to enhance and evaluate 2D and 3D vision-language models for improving spatial understanding in indoor and tabletop environments.

Abstract: Spatial understanding is essential for robots to perceive, reason about, and interact with their environments. However, current visual language models often rely on general-purpose image datasets that lack robust spatial scene understanding and reference frame comprehension (ego-, world-, or object-centric). To address this gap, we introduce RoboSpatial, a large-scale dataset of real indoor and tabletop environments captured via egocentric images and 3D scans. RoboSpatial provides 1M images, 5k 3D scans, and 3M annotated spatial relationships, enabling both 2D and 3D spatial reasoning. Models trained on RoboSpatial outperform baselines on tasks including spatial affordance prediction, spatial relationship prediction, and robot manipulation.

Supplementary Material: pdf

Presenter: ~Chan_Hee_Song1

Format: No, the presenting author is unable to, or unlikely to be able to, attend in person.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding availability would significantly influence their ability to attend the workshop in person.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 59

Loading