More Pictures Say More: Visual Intersection Network for Open Set Object Detection

More Pictures Say More: Visual Intersection Network for Open Set Object Detection

ACL ARR 2025 May Submission561 Authors

13 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Current approaches to open-set object detection heavily rely on vision-language fusion paradigms, yet this methodology faces an inherent challenge: many objects are difficult to describe accurately through language alone. While recent research has attempted to incorporate visual information to address this limitation, existing models still struggle with fine-grained object discrimination. In response, we introduce VINO (Visual Intersection Network for OSOD), a novel DETR-based pure vision model that constructs a multi-image visual bank to preserve semantic intersections across categories and facilitates the fusion of category and region semantics through a multi-stage mechanism. Furthermore, we implement a simple replacement strategy to ensure the model learns alignment capabilities rather than semantic approximation. With an image consumption of only 0.84M, VINO achieves competitive performance on par with vision-language models on benchmarks such as LVIS and ODinW35. Additionally, the successful integration of a segmentation head demonstrates the broad applicability of visual intersection across various visual tasks.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 561

Loading