Abstract: Existing studies typically investigate domain shift and category shift as independent problems, however, in real-world scenarios, the two types of shifts often occur simultaneously and interact, leading to significant degradation in detection performance.
To address this, we propose and systematically study a novel problem—Open-Domain Open-Vocabulary (ODOV) object detection—which aims to evaluate a model’s ability to adapt to the compound domain and category shifts in real-world environments.
We construct a new benchmark, OD-LVIS, which contains 46,949 images spanning 15 diverse real-world scenarios and 1,203 categories, for assessing object detection performance.
Furthermore, we propose a novel ODOV detection baseline that fully leverages VLM's powerful multi-modal alignment capabilities and introduces two key mechanisms to enhance both category and domain generalization. One is the Domain-Agnostic Category Prompt (DAPmt), which strengthens category semantics while attenuating domain representations, enabling pure category representation.
The other is the Domain Projection and Grafting (DP\&G) module, which incorporates domain-specific features from input images, allowing the model to dynamically generalize across diverse open domains.
These two components enable the model to maintain effective detection performance under simultaneous category and domain variations in real-world scenarios.
We provide extensive benchmark evaluations for the proposed ODOV detection task and report experimental results. These results validate the soundness of the ODOV task, the practicality of the OD-LVIS dataset, and the superiority of the method.
Loading