Do 3D Large Language Models Really Understand 3D Spatial Relationships?

Do 3D Large Language Models Really Understand 3D Spatial Relationships?

ICLR 2026 Conference Submission13399 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D-LLM, 3D spatial reasoning

TL;DR: Existing 3D-QA benchmarks overestimate progress due to textual shortcuts, and our Real-3DQA benchmark plus 3D-reweighted fine-tuning enable more faithful evaluation and stronger 3D reasoning.

Abstract: Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that leverages negative samples via explicit 3D-relation alignment, substantially enhancing 3D-LLMs’ performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding.

Primary Area: datasets and benchmarks

Submission Number: 13399

Loading