Keywords: Open-vocabulary object detection, Region-text alignment, Attentive masking
Abstract: Open-vocabulary object detection (OVDet) aims to detect novel categories based on textual descriptions, allowing models to generalize beyond the categories seen during training. However, achieving robust open-vocabulary detection poses significant challenges in aligning text descriptions with specific image regions and capturing spatial relationships between related regions. Most existing methods focus on aligning regions with categorical labels, often overlooking interactions between neighboring regions, limiting their ability to form a precise correspondence between text descriptions and image content. We propose AlignDet, which incorporates an attentive masking strategy to address these challenges. By masking irrelevant regions in the image, our model focuses on the most relevant areas for each text concept, leading to fine-grained region-word correspondences. Additionally, our soft association strategy allows multiple regions to align with a single text concept, capturing spatial relationships between neighboring or related regions of the image more effectively. Extensive experiments demonstrate that our model consistently surpasses existing methods across various benchmarks.
Submission Number: 33
Loading