Class Concept Representation from Contextual Texts for Training-Free Multi-Label Recognition

Dong Un Kang; Hyunwoo Lee; Se Young Chun

Class Concept Representation from Contextual Texts for Training-Free Multi-Label Recognition

Dong Un Kang, Hyunwoo Lee, Se Young Chun

12 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Training-free Multi-Label Recognition, Vision-Language Model, Class Concept, Context-Guided Visual Feature

TL;DR: Inspired by cognitive neuroscience perspective on forming concepts for words, we propose class concept representation for exploiting vision-language model for training-free multi-label recognition.

Abstract: The power of large vision-language models (VLMs) has been demonstrated for diverse vision tasks including multi-label recognition with training-free approach or prompt tuning by measuring the cosine similarity between the text features related to class names and the visual features of images. Prior works usually formed the class-related text features by averaging simple hand-crafted text prompts with class names (e.g., ``a photo of {class name}''). However, they may not fully exploit the capability of VLMs considering how humans form the concepts on words using rich contexts with the patterns of co-occurrence with other words. Inspired by that, we propose class concept representation for zero-shot multi-label recognition to better exploit rich contexts in the massive descriptions on images (e.g., captions from MS-COCO) using large VLMs. Then, for better aligning visual features of VLMs to our class concept representation, we propose context-guided visual representation that is in the same linear space as class concept representation. Experimental results on diverse benchmarks show that our proposed methods substantially improved the performance of zero-shot methods like Zero-Shot CLIP and yielded better performance than zero-shot prompt tunings that require additional training like TaI-DPT. In addition, our proposed methods can synergetically work with existing prompt tuning methods, consistently improving the performance of DualCoOp and TaI-DPT in a training-free manner with negligible increase in inference time.

Supplementary Material: zip

Primary Area: Machine vision

Submission Number: 5550

Loading