Keywords: object detection, vlm, robustness, noise, open vocab, zero-shot
TL;DR: Removing bells and whistles of Open Vocab Object Detectors to narrow down the factors affecting robustness against noises.
Abstract: The impact of real-world noise on Open Vocabulary Object Detectors (OV-ODs) is constrained by their architectural complexity and the scarcity of noise-annotated datasets. Our empirical analysis, Robust Onion, uses controlled synthetic visual degradations to mirror feature collapse of real-world noises and systematically peel apart OV-OD components to assess their robustness. Our findings include: Similar vision backbones show comparable robustness, driven by identical feature collapse at similar layers. Pretraining, architectural nuances, and captions contribute little to robustness. Robustness relies strongly on the image domain rather than on annotations, explaining the similar impact of COCO and LVIS on robustness (same images, different annotations), and how datasets like ODinW-13, with large, isolated objects, can give a misleading impression of high robustness. These insights point to potential research on cross-layer feature exchange and continual learning strategies to improve robustness efficiently. Our findings highlight critical directions for designing robust OV-ODs under challenging visual degradations
Primary Area: interpretability and explainable AI
Submission Number: 10045
Loading