Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Zhenlin Xu; Yi Zhu; Siqi Deng; Abhay Mittal; Yanbei Chen; Manchen Wang; Paolo Favaro; Joseph Tighe; Davide Modolo

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Zhenlin Xu, Yi Zhu, Siqi Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joseph Tighe, Davide Modolo

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: vision-language model, representation learning, benchmark

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We propose new benchmark that evaluate vision-language models for zeroshot recognition on semantic granularity consistency and specificity robustness. and shows most existing contrastive VLMs struggle with this benchmark.

Abstract: This paper introduces innovative benchmarks to evaluate Vision-Language Models (VLMs) in real-world zero-shot recognition tasks, focusing on the pivotal properties of granularity and specificity. We propose a unique evaluation protocol using adapted ImageNet and MS-COCO datasets to assess models' consistency in recognizing concepts at varying granularity levels and their sensitivity to the specificity of language inputs. Our extensive evaluation reveals that state-of-the-art VLMs, including contrastive models like CLIP, struggle with granularity and are sensitive to text specificity, impacting their effectiveness in open-world settings. This comprehensive study, a first in evaluating VLMs from these perspectives, provides valuable insights and tools for the community, highlighting the limitations and paving the way for enhanced models with better generalization in zero-shot recognition. Our benchmark will be open-sourced upon acceptance.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4022

Loading