GIST: Generating Image-Specific Text for Fine-grained Object Representations

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: representation learning, contrastive learning, fine-grained image classification, few-shot learning, large language model, multi-modal models, vision-language
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Recent models pre-trained on image-text pairs can learn rich vision-language representations that improve downstream tasks, such as image classification. However, because of the absence of paired text/image descriptions in many domains, it is difficult to fine-tune these models for many downstream tasks. In this work, we propose GIST -- a method for generating $\textit{image-specific}$ $\textit{fine-grained}$ text descriptions from image-only datasets. Our findings include 1) prompting a pretrained large language model with $\textit{domain-specific}$ prompts generates diverse fine-grained text descriptions that capture the full range of inter-class and intra-class differences, 2) using a pretrained vision-language model to match each training image to the most relevant text descriptions creates image-specific image-text pairs, and 3) summarizing the matched text using a large language model prior to fine-tuning the image encoder improves the utility of the learned representations. We demonstrate the utility of GIST by fine-tuning vision-language models on the output of GIST to learn an aligned vision-language representation space. We evaluate this learned representation space in full-shot and few-shot scenarios across four diverse fine-grained classification datasets, each from a different domain. Our method achieves an average of 1.1% improvement in accuracy over the existing state-of-the-art image-text classification method and 4.1% improvement in accuracy over CLIP linear probes on full-shot datasets. Our method achieves similar improvements across few-shot regimes. Code will be made publicly available upon publication.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6168
Loading