Abstract: Current approaches to open-set object detection heavily rely on vision-language fusion paradigms, yet this methodology faces an inherent challenge: many objects are difficult to describe accurately through language alone. While recent research has attempted to incorporate visual information to address this limitation, existing models still struggle with fine-grained object discrimination. In response, we introduce VINO (Visual Intersection Network for OSOD), a novel DETR-based pure vision model that constructs a multi-image visual bank to preserve semantic intersections across categories and facilitates the fusion of category and region semantics through a multi-stage mechanism. Furthermore, we implement a simple replacement strategy to ensure the model learns alignment capabilities rather than semantic approximation. With an image consumption of only 0.84M, VINO achieves competitive performance on par with vision-language models on benchmarks such as LVIS and ODinW35. Additionally, the successful integration of a segmentation head demonstrates the broad applicability of visual intersection across various visual tasks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 561
Loading