Keywords: ChatGPT, Hierarchical Comparisons, Image Classification, Zero shot
TL;DR: A training-free, explainable, and effective zero-shot image classification method with enriched hierarchical descriptions powered by LLMs.
Abstract: The zero-shot open-vocabulary setting poses challenges for image classification.
Fortunately, utilizing a vision-language model like CLIP, pre-trained on image-text
pairs, allows for classifying images by comparing embeddings. Leveraging large
language models (LLMs) such as ChatGPT can further enhance CLIP’s accuracy
by incorporating class-specific knowledge in descriptions. However, CLIP still
exhibits a bias towards certain classes and generates similar descriptions for similar
classes, disregarding their differences. To address this problem, we present a
novel image classification framework via hierarchical comparisons. By recursively
comparing and grouping classes with LLMs, we construct a class hierarchy. With
such a hierarchy, we can classify an image by descending from the top to the bottom
of the hierarchy, comparing image and text embeddings at each level. Through
extensive experiments and analyses, we demonstrate that our proposed approach is
intuitive, effective, and explainable. Code will be released upon publication.
Supplementary Material: zip
Submission Number: 10004
Loading