Simplifying Self-Supervised Object Detection Pretraining

Ioannis Maniadis Metaxas; Adrian Bulat; Ioannis Patras; Brais Martinez; Georgios Tzimiropoulos

Simplifying Self-Supervised Object Detection Pretraining

Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: unsupervised;self-supervised;object detection

TL;DR: We propose SEER, a novel method for unsupervised object detection pretraining that combines better object proposals, pseudo-labels for pretraining and self-training to achieve state of the art results with object-centric and scene-centric images.

Abstract: Object detectors are often trained by first training the backbone in a self-supervised manner and then fine-tuning the whole model on annotated data. An unsupervised detector pretraining stage can also be interleaved, further improving the final performance and facilitating convergence during the supervised fine-tuning stage. However, existing unsupervised pretraining methods typically rely on low-level information to create pseudo-proposals that the model is then trained to localize, and ignore high-level class membership. The absence of class semantics from the pretraining objective causes a task gap between the pretraining and the downstream scenario, where detection is class-aware (e.g. given an image of a chair, the detector's task is to \textit{both }localize it and assign the ``chair'' class to the corresponding bounding box). This gap results in suboptimal detector pretraining. We propose a framework that better aligns the pretraining and downstream stages. It consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms previous works on the full and low data regimes by significant margins across detector architectures. (2) We show we can pretrain detectors from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning using object detection directly as a pretext task.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5856

Loading