Keywords: Situated Understanding in 3D Scen, 3D VL, LLM
Abstract: Integrating the 3D world into large language models (3D-based LLMs) has been a promising research direction for 3D scene understanding. However, current 3D-based LLMs fall short in situated understanding due to two key limitations: 1) existing 3D datasets are constructed from a global perspective of the 3D scenes and lack situated context.
2) the architectures of the current 3D-based LLMs lack an explicit mechanism for aligning situated spatial information between 3D representations and natural language, limiting their performance in tasks requiring precise spatial reasoning.
In this work, we address these issues by introducing a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial information.
In addition, we propose a situated spatial alignment module to enhance the learning between 3D visual representations and their corresponding textual descriptions. Our experimental results demonstrate that both our dataset and alignment module enhance situated spatial understanding ability.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3857
Loading