Phrase Grounding-based Style Transfer for Single-Domain Generalized Object Detection

Hao Li; Wei Wang; Cong Wang; Mengzhu Wang; Zhigang Luo; Xinwang Liu; Kenli Li

Phrase Grounding-based Style Transfer for Single-Domain Generalized Object Detection

Hao Li, Wei Wang, Cong Wang, Mengzhu Wang, Zhigang Luo, Xinwang Liu, Kenli Li

22 Sept 2023 (modified: 29 Jan 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: single domain generalization, object detection, transfer learning, style transfer

TL;DR: We propose a novel single-domain generalization approach for object detection, which significantly outperforms existing state-of-the-art methods.

Abstract: This paper focuses on a more challenging scenario of single-domain generalized object detection, which aims to learn a detector that performs well on multiple unseen target domains with only one source domain for training. Recently, the grounded language-image pre-training model (GLIP) has gained widespread attention, which reformulates object detection as a phrase grounding task by aligning each region or box to phrases in a textual prompt. Inspired by this, this paper proposes a phrase grounding-based style transfer (PGST) approach for single-domain generalized object detection. Specifically, we introduce a textual prompt that contains a set of phrases for each target domain, such as a car driving in the foggy scene. Subsequently, we use the corresponding target textual prompt to train the PGST module from the source domain to the target domain, and the training losses include the localization loss and region-phrase alignment loss from GLIP. As such, the visual features of the source domain could be close to imaginary counterparts in the target domain while preserving their semantic content. When freezing PGST, we fine-tune the image and text encoders of GLIP using the style-transferred visual features of the source domain, to enhance the generalization of the model to corresponding unseen target domains. Our proposed approach significantly outperforms existing state-of-the-art methods, achieving a mean average precision (mAP) improvement of 8.5\% on average across five diverse weather driving benchmarks. In addition, our performance on some datasets surprisingly matches or even surpasses that of those domain adaptive object detection methods, even though these methods incorporate target domain images into their training process.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4948

Loading