Revisiting Few-Shot Object Detection using Vision-Language Models

16 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Object Detection, Vision-Language Models, Few-Shot Object Detection
Abstract: Few-shot object detection (FSOD) benchmarks have advanced techniques for detecting new categories using limited annotations. Existing FSOD benchmarks re-purpose well-established datasets like COCO by partitioning categories into base and novel classes for pre-training and fine-tuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice. Rather than pre-training on only a small number of categories, we argue that it is more practical to download a foundational model (e.g., a vision-language model (VLM) pretrained on web-scale data) and finetune it for specific applications. Surprisingly, we find that zero-shot inference from foundational VLMs like GroundingDINO significantly outperform state-of-the-art methods (48.3 vs. 33.1 AP) on COCO, suggesting that few-shot detection should be reframed in the context of foundation models. In this work, we propose a new FSOD benchmark protocol that evaluates detectors pre-trained on any external dataset (not including the target dataset), and finetuned on K-shot annotations per C target classes. Further, we note that FSOD benchmarks are actually federated datasets, which are exhaustively annotated for a single category only on a subset of data. We leverage this insight and propose simple strategies for fine-tuning VLMs to improve FSOD. We demonstrate the effectiveness of our approach on LVIS and nuImages
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 766
Loading