Abstract: Open-vocabulary object detection (OVD) has been stud-ied with Vision-Language Models (VLMs) to detect novel objects beyond the pre-trained categories. Previous ap-proaches improve the generalization ability to expand the knowledge of the detector, using ‘positive’ pseudo-labels with additional ‘class' names, e.g., sock, iPod, and alli-gator. To extend the previous methods in two aspects, we propose Retrieval-Augmented Losses and visual Features (RALF). Our method retrieves related ‘negative’ classes and augments loss functions. Also, visual features are aug-mented with ‘verbalized concepts' of classes, e.g., worn on the feet, handheld music player, and sharp teeth. Specif-ically, RALF consists of two modules: Retrieval Aug-mented Losses (RAL) and Retrieval-Augmented visual Fea-tures (RAF). RAL constitutes two losses reflecting the se-mantic similarity with negative vocabularies. In addition, RAF augments visual features with the verbalized con-cepts from a large language model (LLM). Our experiments demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We achieve improvement up to 3.4 box APN 50 on novel categories of the COCO dataset and 3.6 mask APr gains on the LVIS dataset. Code is available at https://github.com/mlvlab/RALF.
Loading