Abstract: Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic), surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This work investigates the application of the text-to-image diffusion generative models in cross-domain feature representation and recognition tasks, and demonstrates the powerful capability of the diffusion models based on multimedia data (texts and images) and architectures in visual feature representation, as well as its great potential in cross-domain recognition. This research contributes to the advancement of generative models in image understanding and recognition.
Supplementary Material: zip
Submission Number: 1983
Loading