Keywords: Training-free Multi-Label Recognition, Vision-Language Model, Class Concept, Context-Guided Visual Feature
TL;DR: Inspired by cognitive neuroscience perspective on forming concepts for words, we propose class concept representation for exploiting vision-language model for training-free multi-label recognition.
Abstract: The power of large vision-language models (VLMs) has been demonstrated for diverse vision tasks including multi-label recognition with training-free approach or prompt tuning by measuring the cosine similarity between the text features related to class names and the visual features of images. Prior works usually formed the class-related text features by averaging simple hand-crafted text prompts with class names (e.g., ``a photo of {class name}''). However, they may not fully exploit the capability of VLMs considering how humans form the concepts on words using rich contexts with the patterns of co-occurrence with other words. Inspired by that, we propose class concept representation for zero-shot multi-label recognition to better exploit rich contexts in the massive descriptions on images (e.g., captions from MS-COCO) using large VLMs. Then, for better aligning visual features of VLMs to our class concept representation, we propose context-guided visual representation that is in the same linear space as class concept representation. Experimental results on diverse benchmarks show that our proposed methods substantially improved the performance of zero-shot methods like Zero-Shot CLIP and yielded better performance than zero-shot prompt tunings that require additional training like TaI-DPT. In addition, our proposed methods can synergetically work with existing prompt tuning methods, consistently improving the performance of DualCoOp and TaI-DPT in a training-free manner with negligible increase in inference time.
Supplementary Material: zip
Primary Area: Machine vision
Submission Number: 5550
Loading