Keywords: zero-shot learning, vision-language
TL;DR: A new framework for zero-shot classification of vision-language models
Abstract: Vision-language models like CLIP have excelled in zero-shot inference by training on vast image-text datasets. However, relying solely on category names during inference limits their performance. Prior work introduced category descriptions generated by large language models (LLMs), aiming to enhance recognition and interpretability, albeit with challenges in capturing distinctions between fine-grained classes. We introduce Pairwise Attribute Contrasting (PAC), a zero-shot inference framework for vision-language models. PAC prompts LLMs to provide specific visual attributes that distinguish category pairs.
To aggregate the pairwise comparisons into a single classification, PAC uses a voting procedure. Specifically, for each test image, all pairwise classifiers are first applied using their own pair-specific attributes to compute image-text similarities. A category receives a vote when it exhibits higher image-text similarity compared to the other class in the pair. Finally, the category that receives the highest vote becomes the final prediction.
PAC shows consistent improvement on 18 benchmark datasets over other strong baselines across various model architectures. We further provide an efficient implementation by only computing text embeddings for unique attributes of a category, which significantly reduces the computation complexity compared to naively computing text embeddings for all attributes.
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 998
Loading