Abstract: Open World Object Detection (OWOD) requires the detector to continuously identify and learn new categories. Existing methods rely on the large language model (LLM) to describe the visual attributes of known categories and use these attributes to mark potential objects. The performance of such methods is influenced by the accuracy of LLM descriptions, and selecting appropriate attributes during incremental learning remains a challenge. In this paper, we propose a novel OWOD framework, termed OW-VAP, which operates independently of LLM and requires only minimal object descriptions to detect unknown objects. Specifically, we propose a Visual Attribute Parser (VAP) that parses the attributes of visual regions and assesses object potential based on the similarity between these attributes and the object descriptions. To enable the VAP to recognize objects in unlabeled areas, we exploit potential objects within background regions. Finally, we propose Probabilistic Soft Label Assignment (PSLA) to prevent optimization conflicts from misidentifying background as foreground. Comparative results on the OWOD benchmark demonstrate that our approach surpasses existing state-of-the-art methods with a +13 improvement in U-Recall and a +8 increase in U-AP for unknown detection capabilities. Furthermore, OW-VAP approaches the unknown recall upper limit of the detector.
Lay Summary: We observed that existing approaches rely heavily on the attribute prediction accuracy of large language models (LLMs). In this paper, we propose an attribute parser that extracts coarse-grained attributes directly from visual regions, rather than relying on fixed, fine-grained attributes. To effectively train the attribute parser, we introduce a probabilistic modeling approach with soft labels. Our evaluation on benchmark demonstrates that the proposed method significantly outperforms previous approaches in performance and approaches or even surpasses the generalization upper bound of attribute detectors.
Primary Area: Deep Learning
Keywords: Object Detection, Open World Object Detection, Open Vocabulary Object Detection
Submission Number: 3791
Loading