Focusing on feature-level domain alignment with text semantic for weakly-supervised domain adaptive object detection
Abstract: Domain adaptive object detection applies the knowledge learned in the source domain to the target domain, thereby improving object detection performance. The existing work focuses on introducing image-level annotations for the target domain and enhancing visual semantics through Weakly-Supervised Domain Adaptive (WSDA) learning paradigms. However, we observe that solely leveraging vision annotations fails to effectively guide the network in learning the intrinsic knowledge of indistinct boundary features, because the similarity between the source and target domain features can lead to entanglement issues. Fortunately, recent research has shown that the Contrastive Language Image Pre-training (CLIP) effectively enhances visual feature representation by introducing text semantics. Inspired by this, we propose the DACLIP, which utilizes text semantics to encourage feature-level domain alignment and solve the problem of feature entanglement. Firstly, we design the Domain-wise Text Assistant (DTA) module, which learns the inherent attributes of the global feature by providing specific text prompts for the domain distribution. Secondly, we design the Class-wise Text Assistant (CTA) module, which utilizes text semantics to guide visual learning and eliminate interference from similar instance attributes at a finer granularity. Unlike the previous method of directly using CLIP instead of detection heads, DACLIP focuses on the feature-level domain alignment, matching the inherent attributes of different domains in the shared feature space from coarse to fine. The proposed DACLIP achieves state-of-the-art performance on multiple datasets, particularly in the challenging PASCAL VOC →<math><mo is="true">→</mo></math> Watercolor scenario where it achieves 62.2% mAP.
External IDs:dblp:journals/ijon/ChenCXHLDT25
Loading