Keywords: 3D Scene Understanding, Large Language Model
Abstract: It is a fundamental challenge for embodied agents to understand and interact with complex 3D scenes. Large language models (LLMs) have demonstrated strong capabilities in text and 2D image understanding. However, existing LLMs with 3D encoders suffer from insufficient paired 3D data for scalable training. In this work, we propose Single-Image and
Text Encoders (SITE), a general framework using a 1D text encoder and a 2D image encoder for structured scene parsing and 3D scene understanding. Specifically, we i) design Scene2Text module to extract instance-level relations, ii) transform multi-view observations into BEV images for interpreting spatial relations, and iii) fuse such 1D and 2D encoders into LLM fine-tuning for consistent 3D understanding. In addition, we introduce InPlan3D, a long-sequence planning benchmark to further evaluate the embodied reasoning ability. Extensive experiments demonstrate the effectiveness and efficiency of SITE on multiple 3D scene understanding datasets and InPlan3D with less token cost. Code and dataset will be publicly released.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17073
Loading