Improving Open-World Object Detection through Richer Instance Representation using Vision Foundation Models
Keywords: Open-World Object Detection, Visual Representation Learning, Visual Perception, Scene Understanding
TL;DR: We propose a method for training an object detector that can accurately detect unknown objects and extract semantically rich features in an open-world setting.
Abstract: While humans naturally identify novel objects
and understand their relationships, deep learning-based object
detectors struggle to detect and relate objects that are not
observed during training. To overcome this issue, Open World
Object Detection (OWOD) has been introduced to enable
models to detect unknown objects in open-world scenarios.
However, OWOD methods fail to capture the finer relationships
between detected objects, which are crucial for comprehensive
scene understanding and applications. In this paper, we propose
a method to train an object detector that can both detect novel
objects and extract semantically rich features in open-world
conditions by leveraging the knowledge of Vision Foundation
Models (VFM). We first utilize the semantic masks from the
Segment Anything Model to supervise the box regression of
unknown objects, ensuring accurate localization. By transferring the instance-wise similarities obtained from the VFM features to the detector’s instance embeddings, our method then learns a semantically rich feature space of these embeddings. Extensive experiments show that our method learns a robust
and generalizable feature space, additionally increasing the detector’s applicability to tasks such as open-world tracking.
Submission Number: 4
Loading