Revisiting Few-Shot Object Detection with Vision-Language Models

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Few-Shot Object Detection, Vision-Language Models, Concept Alignment, Multi-Modal
TL;DR: We modernize few-shot object detection to allow foundation models to "enter the conversation".
Abstract: The era of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of “open-world" perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot predictions from VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundation models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target applications such as autonomous vehicle perception. We argue that the task of few-shot recognition can be reformulated as aligning foundation models to target concepts using a few examples. Interestingly, such examples can be multi-modal, using both text and visual cues, mimicking instructions that are often given to human annotators when defining a target concept of interest. Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data and fine-tuned on multi-modal (text and visual) K-shot examples per target class. We repurpose nuImages for Foundational FSOD, benchmark several popular open-source VLMs, and provide an empirical analysis of state-of-the-art methods. Lastly, we discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 23.3 mAP!
Supplementary Material: pdf
Submission Number: 1078
Loading