Abstract: Open-Vocabulary Object Detection (OVD) tasks have been advanced by enhancing the regional representation capability of the Constrastive Language-Image Pre-training (CLIP). However, as CLIP is trained with image-text pairs, lacking precise object location information, merely enhancing the regional representation capability of the model offers limited improvement in accurately locating novel classes. To address this shortfall, we employ Image Masking (IM) through uniform distribution-based non-repetitive random sampling to remove a large amount of redundant and distracting image blocks, enabling the model to better perceive the semantic information of novel classes. Furthermore, we design an Improved Detailed Feature (IDF) extraction network, which uses original image features in a new branch to supplement detail information and to reconstruct multi-scale features. This mitigates the loss of detail information and the decrease in base-class accuracy caused by discarding a large number of image blocks, and further enhances the detection capability for novel classes. Our method achieves 36.2% mAPr on OV-LVIS and 45.5% AP50novel on OV-COCO, respectively. The abundant experiments have demonstrated the superior performance of our method over other compared methods in OVD task.
Loading