Keywords: Unsupervised Domain Adaptation, semantic segmentation, cross-modal, large language models, multimodal
Abstract: Pixel-level manual annotations are expensive and time-consuming to obtain for semantic segmentation tasks. Unsupervised domain adaptation (UDA), which outperforms direct zero-shot methods, adapts from a label-rich source domain to a target domain where labels are scarce or unavailable. Recent progress in foundational models has demonstrated the potential of large vision-language models (VLMs) in zero-shot segmentation and domain adaptive classification. However, the efficacy of VLMs in bridging domain gaps for semantic segmentation remains under-explored. To improve segmentation performance in UDA, we introduce a novel language-guided adaptation method (LangDA), which aligns image features with VLMs' domain-invariant text embeddings during training. We generate the text embeddings by using a captioning VLM to create image-specific textual descriptions, which are then passed to a frozen CLIP-based encoder. To the best of our knowledge, this is the first work to utilize text to align vision domains in unsupervised domain adaptation for semantic segmentation (DASS). Our proposed language-driven plug-and-play UDA approach achieved a 62.0\% mean Jaccard index on the standard Synthia $\to$ Cityscapes benchmark, outperforming the current state-of-the-art by 0.9\% with negligible parameter overheads.
Submission Number: 150
Loading