Abstract: Autonomous driving (AD) is a typical application that requires effectively exploiting multimedia information. For AD, it is critical to ensure safety by detecting unknown objects in an open world, driving the demand for open world object detection (OWOD). However, existing OWOD methods treat generic objects beyond known classes in the training set as unknown objects and prioritize recall in evaluation. This encourages excessive false positives and endangers safety of AD. To address this issue, we restrict the definition of unknown objects to threatening objects in AD, and introduce a new evaluation protocol, which is built upon a new metric named U-ARecall, to alleviate biased evaluation caused by neglecting false positives. Under the new evaluation protocol, we re-evaluate existing OWOD methods and discover that they typically perform poorly in AD. Then, we propose a novel OWOD paradigm for AD based on fine-tuning foundational open-vocabulary models (OVMs), as they can exploit rich linguistic and visual prior knowledge for OWOD. Following this new paradigm, we propose a brand-new OWOD solution, which effectively addresses two core challenges of fine-tuning OVMs via two novel techniques: 1) the maintenance of open-world generic knowledge by a dual-branch architecture; 2) the acquisition of scenario-specific knowledge by the visual-oriented contrastive learning scheme. Besides, a dual-branch prediction fusion module is proposed to avoid post-processing and hand-crafted heuristics. Extensive experiments show that our proposed method not only surpasses classic OWOD methods in unknown object detection by a large margin ($\sim$3$\times$ U-ARecall), but also notably outperforms OVMs without fine-tuning in known object detection ($\sim$ 20\% K-mAP). Our codes are available at https://github.com/harrylin-hyl/AD-OWOD.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Autonomous driving (AD) is a symbolic artificial intelligence application that requires effectively exploiting multimedia information. This work focuses on exploiting the multimedia information like vision and language to identify the threatening objects in AD, thereby enhancing the diriving safety. Specifically, this work involves the interpretation and processing of language and visual modalities: 1) Generating a general vocabulary bag that includes the texts of threatening objects to assist the language model in comprehending the semantics of “threatening objects”, so as to correctly distinguish between threatening objects and false positives; 2) Fine-tuning foundational open vocabulary models (OVMs) like GroundingDINO to adapt them to the AD scenario. This allows our method can process and fuse the image and text data, possessing the ability to comprehend rich high-level language semantics and recognize diverse visual patterns in the open world; 3) Aligning the visual feature space with the trained language semantic space in OVMs by contrastive learning. This process involves aligning the semantic spaces of the visual and textual modalities. In summary, this work is closely related to the subject areas of “Media Interpretation” and “Vision and Language” in ACMMM.
Supplementary Material: zip
Submission Number: 113
Loading