Examining the Achilles' Heel of CLIP Models: The Worst-Performing Categories

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: vision-language models, worst-class performance, CLIP, prompt ensemble, zero-shot recognition
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts. Although previous studies have demonstrated that satisfactory overall accuracy can be achieved across numerous downstream tasks through well-designed textual prompts, this evaluation mechanism inevitably overlooks certain categories because the impact of some underperforming categories on overall performance remains limited, even if they are highly important. For example, on ImageNet, there are a total of 10 categories with class-wise accuracy as low as 0\%, which is significantly inferior to the overall performance of 64.1\%. This phenomenon reveals the potential risks of using CLIP models, especially in risk-sensitive applications. To address this issue, we investigate the alignment between the two modalities in the CLIP model and propose the Class-wise Matching Margin (\cmm) to measure the inference confusion. \cmm\ can effectively identify the worst-performing categories and estimate the potential performance of the candidate prompts. We further query large language models to enrich descriptions of worst-performing categories and build a weighted ensemble to highlight the efficient prompts. Experimental results clearly verify the effectiveness of our proposal, where the accuracy on the worst-10 categories on ImageNet is boosted to 5.2\%, without manual prompt engineering, laborious optimization, or access to labeled validation data.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4499
Loading