Abstract: The power of large vision-language models (VLMs) has been demonstrated for downstream vision tasks, including multi-label recognition (MLR) with a training-free approach or prompt tuning by measuring the cosine similarity between the text features related to class names and the visual features of the images. Previous studies typically represent class-related text features by averaging simple handcrafted prompts with class names (e.g., “a photo of class name”), which lack the contextual information required to effectively handle diverse content and capture the co-occurrence of multiple semantic classes within an image. In addition, prompt tuning inherently requires labeled data or additional training, making it susceptible to overfitting for context tokens and hindering generalization. To address these limitations, we propose a training-free and label-free MLR framework that leverages the abundant text descriptions of novel classes to narrow the gap between the visual and text features at inference. Inspired by how humans form the concepts on words using contexts or the patterns of co-occurrence with other words, we propose a class concept representation for zero-shot MLR for large VLMs, which the rich contextual information embedded in large-scale image descriptions (e.g., “A person holding a large pair of scissors”). To further improve the alignment of the visual features of VLMs to the class concept representation, we also present a context-guided visual representation based on an attention process that operates in the same linear space as the class concept representation. Experimental results obtained on diverse benchmarks demonstrate that the proposed methods substantially improve the performance of existing zero-shot MLR methods, achieving an average improvement of 9.4% in mean average precision (mAP) over the zero-shot contrastive language-image pretraining (CLIP) method and 2.2% in mAP over the TaI-DPT zero-shot prompt-tuning method. In addition, our proposed method can synergistically work with existing prompt tuning methods, consistently improving the performance of the DualCoOp and TaI-DPT in a training-free manner with a negligible increase in inference time.
Loading