Enhancing open-vocabulary object detection through region-word and region-vision matching

Published: 01 Jan 2025, Last Modified: 30 Jul 2025Multim. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Open-vocabulary object detection (OVOD) aims to detect novel object categories beyond the training set. Existing OVOD methods have made encouraging progress by leveraging large-scale image-caption pairs and pre-trained vision-language models (VLMs). However, two main limitations exhibit: (1) The potential category-specific concepts in global captions are not fully utilized, resulting in a lack of fine-grained semantic guidance for the detector. (2) The compositional structure of multiple concepts naturally existing in image-caption pairs as represented by VLMs remains insufficiently explored, limiting the model’s ability to generalize to novel category concepts. To address these limitations, we propose a novel framework called Region-Word-Vision Matching (RWVM) that integrates two core modules: a Region-Word Matching (RWM) module and a Region-Vision Matching (RVM) module. Our key insight is to simultaneously guide textual and visual knowledge alignment with region features to strengthen the model’s understanding of complex visual scenes. Specifically, the RWM module guides fine-grained semantic aggregation by fusing local region-word matching with global image-caption matching. The RVM module leverages VLMs to capture the compositional structure of single and multiple object concepts, directly enhancing the detector’s ability to learn novel category concepts. Additionally, we demonstrate that the RVM module outperforms embeddings extracted from full language models using only simplified region embeddings. Extensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the average precision (AP) for novel categories on COCO and LVIS datasets.
Loading