Learning to prompt for open-vocabulary object detection with vision-language model

Yu Du

Published: 19 Jun 2022, Last Modified: 11 Nov 2025CVPREveryoneCC BY 4.0

Abstract: Recently, vision-language pre-training shows great po tential in open-vocabulary object detection, where detec tors trained on base classes are devised for detecting new classes. The class text embedding is firstly generated by feeding prompts to the text encoder of a pre-trained vision language model. It is then used as the region classi f ier to supervise the training of a detector. The key ele ment that leads to the success of this model is the proper prompt, which requires careful words tuning and ingenious design. To avoid laborious prompt engineering, there are some prompt representation learning methods being pro posed for the image classification task, which however can only be sub-optimal solutions when applied to the detec tion task. In this paper, we introduce a novel method, de tection prompt (DetPro), to learn continuous prompt rep resentations for open-vocabulary object detection based on the pre-trained vision-language model. Different from the previous classification-oriented methods, DetPro has two highlights: 1) a background interpretation scheme to in clude the proposals in image background into the prompt training; 2) a context grading scheme to separate propos als in image foreground for tailored prompt training. We assemble DetPro with ViLD, a recent state-of-the-art open world object detector, and conduct experiments on the LVIS as well as transfer learning on the Pascal VOC, COCO, Objects365 datasets. Experimental results show that our DetPro outperforms the baseline ViLD [7] in all settings, e.g., +3.4 APbox and +3.0 APmask improvements on the novel classes of LVIS. Code and models are available at https://github.com/dyabel/detpro