Visual Foundation Models Boost Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

Published: 2025, Last Modified: 21 Jan 2026IEEE Trans. Intell. Transp. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Unsupervised domain adaptation (UDA) is vital for alleviating the workload of labeling 3D point cloud data and mitigating the absence of labels when facing an unseen domain. Various methods have recently emerged to utilize images along with point clouds to enhance the performance of cross-domain 3D segmentation. However, the pseudo labels, which are generated from models trained on the source domain and provide additional supervised signals for the target domain, are inadequate when utilized for 3D segmentation due to their inherent noisiness and consequently restrict the accuracy of neural networks. With the advent of 2D Visual Foundation Models (VFMs) and their abundant knowledge prior, we propose a novel pipeline VFMSeg to further enhance the cross-modal UDA framework by leveraging these models. In this work, we study how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance. We first utilize a multi-modal VFM, which is pre-trained on large-scale image-text pairs, to provide supervised labels (VFM-PL) for images and point clouds from the target domain. Then, we adopt another VFM to generate fine-grained 2D masks for guiding the generation of augmented images and point clouds, which mix the data from source and target domains like view frustums (FrustumMixing). Finally, we merge class-wise prediction across modalities to produce more accurate annotations for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement in 3D segmentation task. Our code is available at https://github.com/EtronTech/VFMSeg
Loading