Abstract: This paper addresses the issue of cross-class domain adaptation (CCDA) in semantic segmentation, where the target domain contains both shared and novel classes that are either unlabeled or unseen in the source domain. This problem is challenging, as the absence of labels for novel classes hampers the accurate segmentation of both shared and novel classes. Since Visual Language Models (VLMs) are capable of generating zero-shot predictions without requiring task-specific training examples, we propose a label alignment method by leveraging VLMs to relabel pseudo labels for novel classes. Considering that VLMs typically provide only image-level predictions, we embed a two-stage method to enable fine-grained semantic segmentation and design a threshold based on the uncertainty of pseudo labels to exclude noisy VLM predictions. To further augment the supervision of novel classes, we devise memory banks with an adaptive update scheme to effectively manage accurate VLM predictions, which are then resampled to increase the sampling probability of novel classes. Through comprehensive experiments, we demonstrate the effectiveness and versatility of our proposed method across various CCDA scenarios.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: ACM MM serves as a platform for researchers from various domains, including computer vision, natural language processing, and multimedia analysis. Our work at the intersection of computer vision, language understanding, and domain adaptation aligns well with the conference's scope. Furthermore, our approach presents novel advancements in leveraging visual language models for domain adaptation, which could contribute significantly to the multimedia community.
Supplementary Material: zip
Submission Number: 2665
Loading