Does Your 3D Encoder Really Work? A simple yet effective pathway to real 3D scene understanding

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Spatial intelligence, 3D scene understanding
Abstract: Remarkable progress in 2D Vision-Language Models (VLMs) has spurred interest in extending them to 3D settings. Based on their encoder design, this paper categorizes recent 3D VLMs into 3D scene-centric, 3D object-centric and 2D image-based approaches. Despite their architectural similarity to 2D counterparts, 3D scene-centric VLMs have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches. To understand this gap, we conduct an in-depth analysis and find that these models show unstable data scaling capabilities and limited reliance on the 3D scene encoder, instead overfitting to linguistic cues and frequent answers. Although data balancing of under-sampling offers partial improvements, it fails to address the fundamental problem, as the model continues to largely ignore the 3D scene input. To address these limitations and encourage genuine 3D scene understanding, we introduce a simple yet effective training strategy: __rearranging the input sequence__. By positioning the 3D scene between the question and the answer, we prevent the model from learning shortcuts from linguistic cues alone and compel it to ground its comprehension in the visual context. Our experiments show this method not only improves the model's genuine understanding, but also restores the effectiveness of standard pre-training and supervised fine-tuning stages. Crucially, our approach ensures the 3D encoder plays an essential role, laying a more robust foundation for future 3D VLM research.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7933
Loading