Abstract: Recently, vision-language pre-training shows great po
tential in open-vocabulary object detection, where detec
tors trained on base classes are devised for detecting new
classes. The class text embedding is firstly generated by
feeding prompts to the text encoder of a pre-trained vision
language model. It is then used as the region classi
f
ier to supervise the training of a detector. The key ele
ment that leads to the success of this model is the proper
prompt, which requires careful words tuning and ingenious
design. To avoid laborious prompt engineering, there are
some prompt representation learning methods being pro
posed for the image classification task, which however can
only be sub-optimal solutions when applied to the detec
tion task. In this paper, we introduce a novel method, de
tection prompt (DetPro), to learn continuous prompt rep
resentations for open-vocabulary object detection based on
the pre-trained vision-language model. Different from the
previous classification-oriented methods, DetPro has two
highlights: 1) a background interpretation scheme to in
clude the proposals in image background into the prompt
training; 2) a context grading scheme to separate propos
als in image foreground for tailored prompt training. We
assemble DetPro with ViLD, a recent state-of-the-art open
world object detector, and conduct experiments on the LVIS
as well as transfer learning on the Pascal VOC, COCO,
Objects365 datasets. Experimental results show that our
DetPro outperforms the baseline ViLD [7] in all settings,
e.g., +3.4 APbox and +3.0 APmask improvements on the
novel classes of LVIS. Code and models are available at
https://github.com/dyabel/detpro
Loading