CLIP2UDA: Making Frozen CLIP Reward Unsupervised Domain Adaptation in 3D Semantic Segmentation

Published: 20 Jul 2024, Last Modified: 01 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multi-modal Unsupervised Domain Adaptation (MM-UDA) for large-scale 3D semantic segmentation involves adapting 2D and 3D models to a target domain without labels, which significantly reduces the labor-intensive annotations. Existing MM-UDA methods have often attempted to mitigate the domain discrepancy by aligning features between the source and target data. However, this implementation falls short when applied to image perception due to the susceptibility of images to environmental changes compared to point clouds. To mitigate this limitation, in this work, we explore the potentials of an off-the-shelf Contrastive Language-Image Pre-training (CLIP) model with rich whilst heterogeneous knowledge. To make CLIP task-specific, we propose a top-performing method, dubbed \textbf{CLIP2UDA}, which makes frozen CLIP reward unsupervised domain adaptation in 3D semantic segmentation. Specifically, CLIP2UDA alternates between two steps during adaptation: (a) Learning task-specific prompt. 2D features response from the visual encoder are employed to initiate the learning of adaptive text prompt of each domain, and (b) Learning multi-modal domain-invariant representations. These representations interact hierarchically in the shared decoder to obtain unified 2D visual predictions. This enhancement allows for effective alignment between the modality-specific 3D and unified feature space via cross-modal mutual learning. Extensive experimental results demonstrate that our method outperforms state-of-the-art competitors in several widely-recognized adaptation scenarios. Code is available at: \textcolor{blue}{\url{https://github.com/Barcaaaa/CLIP2UDA}}.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: In this work, we delve into the challenges faced by perceptual multi-models in the robustness of domain shifts in 3D scene understanding. Our findings underscore the need for more effective solutions in this region for 3D scene understanding in autonomous driving.
Supplementary Material: zip
Submission Number: 1119
Loading