LOVD: Large Open Vocabulary Object detection

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Existing open-vocabulary object detectors require an accurate and compact vocabulary pre-defined during inference. Their performance is largely degraded in real scenarios where the underlying vocabulary may be indeterminate and often exponentially large. To have a more comprehensive understanding of this phenomenon, we propose a new setting called Large-and-Open Vocabulary object Detection, which simulates real scenarios by testing detectors with large vocabularies containing thousands of unseen categories. The vast unseen categories inevitably lead to an increase in category distractors, severely impeding the recognition process and leading to unsatisfactory detection results. To address this challenge, We propose a Large and Open Vocabulary Detector (LOVD) with two core components, termed the Image-to-Region Filtering (IRF) module and Cross-View Verification (CV$^2$) scheme. To relieve the category distractors of the given large vocabularies, IRF performs image-level recognition to build a compact vocabulary relevant to the image scene out of the large input vocabulary, followed by region-level classification upon the compact vocabulary. CV$^2$ further enhances the IRF by conducting image-to-region filtering in both global and local views and produces the final detection categories through a multi-branch voting mechanism. Compared to the prior works, our LOVD is more scalable and robust to large input vocabularies, and can be seamlessly integrated with predominant detection methods to improve their open-vocabulary performance. Source code will be made publicly available.

Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work delves into the intricacies of multimodal processing within the multimedia domain, starting from the foundation of open-vocabulary detection (OVD), which inherently combines visual and textual modalities for object recognition. Building on this multimodal integration, we introduce a new task, large and open-vocabulary detection, which extends OVD by challenging systems to accurately recognize and localize a broader array of object categories, including those unseen during training. The development of the Large and Open Vocabulary Detector is a targeted solution to this task, aiming to improve how visual and textual modalities interact within multimedia systems. By incorporating the Image-to-Region Filtering module and Cross-View Verification scheme, our approach facilitates a more nuanced processing of visual scenes alongside extensive textual vocabularies. This method highlights the necessity of advanced multimodal processing strategies in managing the complexities introduced by large vocabularies, offering a scalable and versatile solution for multimedia applications that demand a sophisticated, context-aware understanding of both visual and textual data. Through this, we underscore the pivotal role of multimodal integration in enhancing the analytical capabilities of multimedia systems.
Supplementary Material: zip
Submission Number: 935
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview