Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

Fangrui Zhu; Hanhui Wang; Yiming Xie; Jing Gu; Tianye Ding; Jianwei Yang; Huaizu Jiang

Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal learning, spatial reasoning, multimodal large language models

Abstract: Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can MLLMs reason about 3D space using only structured 2D representations derived from perception? In this work, we introduce Struct2D, a perception-guided prompting framework that combines bird’s-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source MLLMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with projected 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Motivated by these findings, we construct a large-scale instructional tuning dataset, \textbf{Struct2D-Set}, using an automated pipeline that generates fine-grained QA pairs grounded in 3D indoor scenes. We then fine-tune an open-source MLLM (Qwen2.5VL) using Struct2D-Set, relying on noisy 3D perception rather than ground-truth annotations. Despite this, the tuned model achieves strong performance across multiple spatial reasoning benchmarks, including 3D question answering, captioning, and object grounding, spanning eight diverse reasoning categories. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in MLLMs—without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 2069

Loading