Beyond Open-World: COSRA, a Training-Free Self-Refining Approach to Open-Ended Object Detection

Amirreza Rouhi; Solmaz Arezoomandan; Knut Peterson; Diego Patiño; David K. Han

Beyond Open-World: COSRA, a Training-Free Self-Refining Approach to Open-Ended Object Detection

Amirreza Rouhi, Solmaz Arezoomandan, Knut Peterson, Diego Patiño, David K. Han

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Open-Ended Object Detection, Open-WorldOpen-World Object Detection, Large Language Models, Context-Aware Annotation

TL;DR: COSRA is a training-free, context-aware, self-refining method for open-ended object labeling. LLM-proposed labels + visual embeddings (ELR) with KNN/frequency voting and cross-modal re-ranking label novel objects; SOTA on COCO/LVIS.

Abstract: Traditional object detection models rely on predefined categories, which limit their ability to recognize unseen objects in open-world settings. Recent efforts in open-world and open-ended detection have begun to address this challenge by enabling models to go beyond closed-set assumptions. However, these approaches often remain limited in terms of scalability, adaptability, or generalization to diverse environments. To overcome these restrictions, we introduce a Context-oriented Open-ended Self-Refining Annotation model (COSRA), a training-free framework that combines context-aware reasoning with self-learning for open-ended object labeling. COSRA leverages Large Language Models (LLMs) to generate candidate labels for unknown objects based on contextual cues from known entities within a scene. COSRA then pairs these labels with visual embeddings to construct an Embedding-Label Repository (ELR), enabling inference without category supervision. To further enhance consistency, we introduce a self-refinement loop that re-evaluates repository labels using visual cohesion analysis and KNN-based majority relabeling. For a newly encountered unknown object, COSRA retrieves visually similar instances from the ELR and applies frequency-based voting and cross-modal re-ranking to assign a robust label. Our experimental results on COCO and LVIS datasets demonstrate that COSRA outperforms state-of-the-art methods and effectively annotates novel categories using only visual and contextual signals without requiring any fine-tuning or retraining.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 9591

Loading