Abstract: As an emerging task bridging vision and language, Language-grounded Multimodal 3D Scene Understanding (3D-LMSU) has attracted significant interest across various domains, such as robot navigation and human–computer interaction. It aims to generate detailed and precise responses to textual queries related to 3D scenes. Despite the popularity and effectiveness of existing methods, the absence of a comprehensive survey hampers further development. In this study, we present the first systematic survey of recent progress in addressing this gap. We start with a concise overview of the background, including the problem definition and available benchmark datasets. Subsequently, we introduce a novel taxonomy that provides a comprehensive classification of existing methods based on technologies and tasks. We then present the evaluation metrics for each task, along with the performance results of various methods. Furthermore, we offer insightful discussions from three critical perspectives: data, framework, and training. Finally, we conclude the paper by highlighting several promising avenues for future research. This study synthesizes the field and guides researchers toward further exploration.
Loading