On Inherent 3D Reasoning of VLMs in Indoor Scene Layout Design

Amlan Kar; David Acuna; Sanja Fidler

On Inherent 3D Reasoning of VLMs in Indoor Scene Layout Design

Amlan Kar, David Acuna, Sanja Fidler

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLMs, Evaluation, 3D

TL;DR: We evaluate VLMs 3D understanding on key tasks necessary to build an agent-based indoor scene layout design pipeline

Abstract: Large vision-language models (VLMs) such as GPT-4o, Llama-3.2 have shown remarkable capabilities in visual understanding and reasoning, prompting us to test their off-the-shelf ability to reason and act as a 3D design assistant. This study investigates VLMs’ visual reasoning capabilities using 3D indoor scene layout synthesis i.e. placement of furniture in a room, as a test-bed. We study three key primitive abilities in this context: (1) communication of spatial locations, (2) reasoning about free space and object collision, and (3) reasoning about object alignment, orientation, and functionality, each crucial to creating a VLM agent-based scene layout synthesis pipeline. We evaluate five state-of-the-art VLMs, both proprietary and open, on a new dataset incorporating 3400 questions that assess VLMs’ current visual reasoning abilities in our context. Our findings reveal several remarkable insights: (1) VLMs consistently prefer normalized coordinates for spatial communication over absolute coordinates or pointing with image markers. (2) Contrary to expectations, VLMs perform best with simplified sketch based scene representation or, most strikingly, with no visual input at all, compared to detailed renderings. (3) Free space reasoning remains challenging, with performance only slightly above random guessing, though frontier models show significant improvement with collision checking tools. Surprisingly, free space reasoning with clear visible collisions in the image can also fail. (4) Reasoning about object alignment, size, orientation and functionality together compounds errors leading to near chance performance on our dataset. These findings serve to offer insights into current potential and limitations of using VLMs off-the-shelf towards developing advanced visual assistants capable of understanding and manipulating 3D environments.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10497

Loading