Abstract: Self-supervised pre-training and transformer-based networks have significantly improved the performance of object detection. However, most of the current self-supervised object detection methods are built on convolutional-based architectures. We believe that the transformers' sequence characteristics should be considered when designing a transformer-based self-supervised method for the object detection task. To this end, we propose a novel transformer-based self-supervised learning method for object detection, using the sequence consistency between the online and the momentum branches. The proposed method minimizes the discrepancy of the output sequences of transformers with different image views as input and leverages bipartite matching to find the most relevant sequence pairs to improve the sequence-level self-supervised representation learning performance. Our method achieves state-of-the-art results on MS COCO (45.6 AP) and PASCAL VOC (63.9 AP), demonstrating the effectiveness of our approach.
0 Replies
Loading