R2D3: Imparting Spatial Reasoning by Reconstructing 3D Scenes from 2D Images

Arijit Ray, Dina Bashkirova, Reuben Tan, Kuo-Hao Zeng, Bryan A. Plummer, Ranjay Krishna, Kate Saenko

Published: 06 Jun 2024, Last Modified: 05 Sept 2024OpenReview Archive Direct UploadEveryoneCC0 1.0

Abstract: Cognitive scientists herald 3D spatial reasoning as a fundamental foundation for all intellectual processes. Multimodal large language models (MLMs), which have been widely adopted due to their impressive commonsense reasoning on 2D images, have been shown to lack 3D spatial reasoning. There is limited evaluation of what imparts precise 3D spatial capabilities to these models. Existing benchmarks for probing spatial understanding in MLMs mostly focus on coarse-level spatial awareness (eg, to the left vs right of), or on predicting a bounding box for a given object query. Instead, we wish to conduct a more holistic evaluation of the model's semantic and spatial understanding of the entire scene. Hence, we propose a benchmark, R2D3, where an MLM is tasked to represent a 2D image as a set of semantic assets with precise 3D locations and poses that can accurately reconstruct the 3D scene in a graphics engine. This task of "analysis by synthesis" requires the model to have a comprehensive understanding of the elements that make up the scene and their precise 3D relative locations. Our benchmark includes 12K indoor scenes in the AI2THOR environment and is compatible with several downstream applications such as embodied AI, spatial reasoning, and navigation tasks. Using our benchmark, we explore tuning techniques for MLMs that encourage precise spatial reasoning. Surprisingly, we find that conventional fine-tuning on the training set of our benchmark, while enough to understand semantics, is not enough to learn the precise 3D locations and poses of the objects in a scene. However, including depth or conveying the precise camera-scene orientation by marking a point in the image and including its 3D coordinate during training allows the model to improve 3D spatial estimation at test time. We hope that the R2D3 benchmark will help drive progress in exploring design choices that improve the spatial understanding of MLMs.