I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image ClassificationDownload PDFOpen Website

Published: 2023, Last Modified: 27 Jan 2024CVPR 2023Readers: Everyone
Abstract: Recent works have shown that unstructured text (doc-uments) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annota-tors as examples. The LLM is conditioned on these exam-ples to generate multiple text descriptions for each class (re-ferred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embed-ding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings. Code available at https://github.com/ferjad/I2DFormer
0 Replies

Loading