Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

19 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: grounding, vision-language, open-vocabulary localization
Abstract: Vision-language models have shown remarkable performance in various fields, ranging from zero-shot classification to captioning and prompt-based image generation. But so far, those models do not seem able to localize referential expressions and objects in images, with the result that they are only used as a post-process labeling step or that they need to be fine-tuned for this task. The following work, we show that vision-language (VL) models trained with image-level objectives hold object localization properties. We propose a Grounding Everything Model (GEM) that allows to leverage these properties without retraining or fine-tuning the pretrained model. To this end, we extend the idea of v-v attention introduced by CLIPSurgery to a generalized self-self attention path and propose a set of regularizations that allows the model to better generalize across datasets and backbones. We further show how the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar, while preserving the alignment with the language space. We evaluate the proposed GEM framework on three benchmark datasets and improve the performance in training-free open-vocabulary localization.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2028
Loading