Zero-Shot Recognition with Guided Cropping

ICLR 2024 Workshop ME-FoMo Submission7 Authors

Published: 04 Mar 2024, Last Modified: 29 Apr 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: zero-shot, open-vocabulary, CLIP, image classification
TL;DR: We improve CLIP zero-shot object recognition by increasing object-relevant and minimizing object-irrelevant information in the image encodings.
Abstract: Pretrained vision-language models, e.g., CLIP, show promising zero-shot transfer capability across various unseen classification datasets. However, there is an inherent limitation: CLIP image encoders are typically designed to extract generic image-level features that summarize superfluous or confounding information for target tasks. This results in degradation of classification performance, especially when objects of interest cover small areas of input images. In this work, we propose CLIP with Guided Cropping (GC-CLIP), where we use an off-the-shelf zero-shot object detection model in a preprocessing step to increase the focus of zero-shot classifiers on the object of interest and minimize the influence of extraneous image regions. We empirically show that our approach improves zero-shot performance across architectures and datasets, most favorably for small objects.
Submission Number: 7
Loading