A survey of language-grounded multimodal 3D scene understanding

Ruilong Ren; Xinyu Zhao; Weichen Xu; Jian Cao; Xinxin Xu; Xing Zhang

A survey of language-grounded multimodal 3D scene understanding

Ruilong Ren, Xinyu Zhao, Weichen Xu, Jian Cao, Xinxin Xu, Xing Zhang

Published: 01 Jan 2025, Last Modified: 26 Jul 2025Knowl. Based Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: As an emerging task bridging vision and language, Language-grounded Multimodal 3D Scene Understanding (3D-LMSU) has attracted significant interest across various domains, such as robot navigation and human–computer interaction. It aims to generate detailed and precise responses to textual queries related to 3D scenes. Despite the popularity and effectiveness of existing methods, the absence of a comprehensive survey hampers further development. In this study, we present the first systematic survey of recent progress in addressing this gap. We start with a concise overview of the background, including the problem definition and available benchmark datasets. Subsequently, we introduce a novel taxonomy that provides a comprehensive classification of existing methods based on technologies and tasks. We then present the evaluation metrics for each task, along with the performance results of various methods. Furthermore, we offer insightful discussions from three critical perspectives: data, framework, and training. Finally, we conclude the paper by highlighting several promising avenues for future research. This study synthesizes the field and guides researchers toward further exploration.

Loading