Open-Det: An Efficient Learning Framework for Open-Ended Detection

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Open-Ended Detection
Abstract: Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det.
Lay Summary: How to detect novel objects in open-world scenes and name them freely without predefined category lists? This capability, known as Open-Ended Object Detection, presents practical value but remains challenging. Existing approaches like GenerateU suffer from three critical limitations: (1) dependence on massive training data (millions of images), (2) slow convergence (requiring hundreds of training epochs), and (3) suboptimal detection accuracy. We propose Open-Det, a novel and efficient learning framework that accelerates model training in both the visual localization and open-vocabulary name generation. The modality gap between vision and language is mitigated by Vision-to-Language and Language-to-Vision alignment and knowledge transferring from the VLM model. Additionally, we optimize the training process by modeling inter-object relations, achieving significantly improved data efficiency and accuracy. Our work demonstrates that the predefined categories priors are unnecessary for novel categories detection in open-world settings. This enables practical and flexible vocabulary-free open-world detection. These advancements hold significant potential for real-world applications, such as autonomous driving and security, while also advancing research in Open-Ended Detection.
Primary Area: Deep Learning
Keywords: Open-Ended Detection; Framework; Prompts Distillation; Loss Functions
Submission Number: 1421
Loading