Open-Vocabulary Multi-label Image Classification with Pretrained Vision-Language ModelDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 02 Nov 2023ICME 2023Readers: Everyone
Abstract: We design an open-vocabulary multi-label image classification model to predict multiple novel concepts in an image based on a powerful language-image pretrained model i.e. CLIP. While CLIP achieves a remarkable performance on single-label zero-shot image classification, it only utilizes global image feature which is less applicable for predicting multiple labels. To address the problem, we propose a novel method that contains an Image-Text attention module to extract multiple class-specific image features from CLIP. In addition, we introduce a new training method with contrastive loss to help the attention module find diverse attention masks for all classes. During testing, the class-specific features are interpolated with CLIP features to boost the performance. Extensive experiments show that our proposed method achieves state-of-the-art performance on zero-shot learning tasks for multi-label image classifications on two benchmark datasets.
0 Replies

Loading