LOCAL-DETR: Localized Open-vocabulary Contrastive Alignment via Semantic Caching

ACL ARR 2025 May Submission6344 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Open-vocabulary object detection (OVOD) aims to detect novel categories beyond the closed-world setting. Existing methods often rely heavily on multimodal fusion, leading to inefficiencies and poor generalization. Inspired by hierarchical visual perception in cognitive science, we propose LOCAL-DETR, a purely vision-based framework featuring a Dynamically Hierarchical Semantic Prototype Repository (DHSAP) that self-supervisedly mines semantic anchors from DETR’s multi-scale attention maps. To balance base-class detection and open-vocabulary generalization, we introduce a Dual-stream Decoupled Training Paradigm, consisting of a base phase for robust detection via spatial-semantic co-modeling, and an open phase for semantic alignment using contrastive prototype distillation. Experiments demonstrate that LOCAL-DETR effectively balances accuracy and generalization, providing an efficient approach for open-set object perception.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality,cross-modal pretraining,image text matching
Contribution Types: Approaches to low-resource settings, Theory
Languages Studied: English
Submission Number: 6344
Loading