Taming Self-Training for Open-Vocabulary Object Detection

Shiyu Zhao; Samuel Schulter; Long Zhao; Zhixing Zhang; Vijay Kumar b g; Yumin Suh; Manmohan Chandraker; Dimitris N. Metaxas

Taming Self-Training for Open-Vocabulary Object Detection

Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar b g, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: Open-vocabulary object detection, pseudo labels, vision and language pretraining

TL;DR: Improve pseudo labels from pretrained vision and language models with self-training and a split-and-fusion head

Abstract: Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD in two key aspects. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can prevent noisy boxes of PLs from supervision. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL's distributions in OVD are solely determined by the teacher model. Consequently, we introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions. Extensive experiments demonstrate SAS-Det is both efficient and effective. Our pseudo labeling is three times faster than prior methods. SAS-Det outperforms prior state-of-the-art models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively.

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4235

Loading