Keywords: Open-vocabulary Object Detection; Object-level Vision-Language Pretraining
TL;DR: A novel open-vocabulary detection framework learning region-word alignment by object co-occurrence across images.
Abstract: Deriving reliable region-word alignment from image-text pairs is critical to learn
object-level vision-language representations for open-vocabulary object detection.
Existing methods typically rely on pre-trained or self-trained vision-language
models for alignment, which are prone to limitations in localization accuracy or
generalization capabilities. In this paper, we propose CoDet, a novel approach
that overcomes the reliance on pre-aligned vision-language space by reformulating
region-word alignment as a co-occurring object discovery problem. Intuitively, by
grouping images that mention a shared concept in their captions, objects corresponding
to the shared concept shall exhibit high co-occurrence among the group.
CoDet then leverages visual similarities to discover the co-occurring objects and
align them with the shared concept. Extensive experiments demonstrate that CoDet
has superior performances and compelling scalability in open-vocabulary detection,
e.g., by scaling up the visual backbone, CoDet achieves 37.0 $AP^m_{novel}$ and
44.7 $AP^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $AP^m_{novel}$
and 9.8 $AP^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.
Supplementary Material: pdf
Submission Number: 2038
Loading