OVRD: Open-Vocabulary Relation DINO with Text-guided Salient Query Selection

ICLR 2026 Conference Submission16625 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-modal Learning, Open-Vocabulary, Object Detection
TL;DR: An open-vocabulary object detection model to explore relation modeling in open-vocabulary scenarios while enhancing multi-modal fusion through text-guided salient query selection.
Abstract: Open-Vocabulary Detection (OVD) trains on base categories and generalizes to novel categories with the aid of text embeddings from Vision-Language Models (VLMs). However, existing methods are insufficient in utilizing semantic cues from the text embeddings to guide visual perception, which hinders the performance of zero-shot object detection. In this paper, we propose OVRD, an Open-Vocabulary Relation DINO with text-guided salient selections. Specifically, we introduce text-guided salient query selection to choose image features most relevant to the text embeddings, along with their corresponding reference points and masks, thereby providing additional semantic cues for guiding visual perception. Building upon this, the salient reference points are used to recover the relative spatial structure of the selected features, enhancing positional awareness in the salient transformer decoder. Moreover, to fully leverage both the semantic cues and the recovered spatial structure, we develop a self-attention model of semantic relationships to model sparse semantic relations in OVD scenarios to further guide visual perception. We evaluate OVRD on public benchmarks in a zero-shot setting, achieving 37.0 AP on LVIS Minival, which performs favorably against the state-of-the-art methods. The code is available at https://anonymous.4open.science/r/OVRD.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16625
Loading