Beyond Open-World: COSRA, a Training-Free Self-Refining Approach to Open-Ended Object Detection

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Open-Ended Object Detection, Open-WorldOpen-World Object Detection, Large Language Models, Context-Aware Annotation
TL;DR: COSRA is a training-free, context-aware, self-refining method for open-ended object labeling. LLM-proposed labels + visual embeddings (ELR) with KNN/frequency voting and cross-modal re-ranking label novel objects; SOTA on COCO/LVIS.
Abstract: Traditional object detection models rely on predefined categories, which limit their ability to recognize unseen objects in open-world settings. Recent efforts in open-world and open-ended detection have begun to address this challenge by enabling models to go beyond closed-set assumptions. However, these approaches often remain limited in terms of scalability, adaptability, or generalization to diverse environments. To overcome these restrictions, we introduce a Context-oriented Open-ended Self-Refining Annotation model (COSRA), a training-free framework that combines context-aware reasoning with self-learning for open-ended object labeling. COSRA leverages Large Language Models (LLMs) to generate candidate labels for unknown objects based on contextual cues from known entities within a scene. COSRA then pairs these labels with visual embeddings to construct an Embedding-Label Repository (ELR), enabling inference without category supervision. To further enhance consistency, we introduce a self-refinement loop that re-evaluates repository labels using visual cohesion analysis and KNN-based majority relabeling. For a newly encountered unknown object, COSRA retrieves visually similar instances from the ELR and applies frequency-based voting and cross-modal re-ranking to assign a robust label. Our experimental results on COCO and LVIS datasets demonstrate that COSRA outperforms state-of-the-art methods and effectively annotates novel categories using only visual and contextual signals without requiring any fine-tuning or retraining.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9591
Loading