RegionSpot: Unleashing the Power of Frozen Foundation Models for Open-World Region Understanding

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Open world Region Understanding
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Understanding the semantics of individual regions or patches within unconstrained images, such as in open-world object detection, represents a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and a deficiency in contextual information. % To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their strengths in localization and semantics. We introduce a novel, generic, and efficient region recognition architecture, named \textit{\modelname{}}, designed to integrate position-aware localization knowledge from a localization foundation model (e.g., SAM) with multimodal information extracted from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. % Through extensive experiments in the context of open-world object recognition, our \textit{\modelname{}} demonstrates significant performance improvements over prior alternatives, while also providing substantial computational savings. {For instance, training our model with 3 million data in 1 day using 8 V100 GPUs.} Our model outperforms GLIP by 6.5\% in mean average precision (mAP), with an even larger margin by 14.8\% for more challenging and rare categories. Our source code will be made publicly available.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2923
Loading