Open-World 3D Scene Understanding with Cross-Modal Dual Consistency Learning

Published: 01 Jan 2025, Last Modified: 18 Oct 2025ICMR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large Vision-Language Pre-training models have achieved remarkable advancements in 2D zero-shot or few-shot visual tasks in the 2D domain. However, promoting their potential to benefit the 3D counterparts is full of challenging due to the notable hindrance of limited 3D-text pairs, which results in the open-world 3D scene understanding remaining an unexplored problem. In this paper, we pretrain a cross-modal dual consistency learning based 3D visual-language model to learn semantically-rich 3D point cloud representation. Specifically, we first introduce a Visual Feature Distribution Consistency strategy to bridge the gap between the point clouds and images. Then, a Visual-Semantic Enhancement Feature Distribution Consistency approach is developed to narrow the distance between enhanced visual and language information. Finally, under the supervision of profound knowledge from 2D Vsion-Language Models (VLMs), the learned 3D features achieve powerful generalization capability, facilitating open-world 3D scene understanding. Quantitative and qualitative evaluations on ScanNet and Matterport3D benchmarks demonstrate the effectiveness of our pre-trained method in open-world 3D semantic segmentation.
Loading