Abstract: Some recent methods address few-shot image classification by extracting semantic information from class names and devising mechanisms for aligning vision and semantics to integrate information from both modalities. However, class names provide only abstract information, which is insufficient to capture the visual details present in images. As a result, this vision-semantics alignment is inherently biased, leading to sub-optimal integration outcomes. In this paper, we avoid this biased vision-semantics alignment by introducing CLIP, a natural bridge between vision and semantics, and enforcing unbiased vision-vision alignment as a proxy task. Specifically, we align features encoded from the same image by both the few-shot encoder and CLIP's vision encoder. This alignment is accomplished through a linear layer, with a training objective formulated using optimal transport-based assignment prediction. Thanks to the inherent alignment between CLIP's vision and text encoders, the few-shot encoder is indirectly aligned to CLIP's text encoder, which serves as the foundation for better vision-semantics integration. In addition, to further improve vision-semantics integration at the testing stage, we mine potential fine-grained semantic attributes of class names from large language models. Correspondingly, an online optimization module is designed to adaptively integrate the semantic attributes and visual information extracted from images. Extensive results on four datasets demonstrate that our method outperforms state-of-the-art methods.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: In this work, we delve into leveraging textual information regarding unseen classes to improve the efficacy of few-shot image classification. We primarily tackle these critical issues:
(1) Achieving unbiased alignment between visual and semantic modalities.
(2) Mining potential semantic attributes of classes using large language models.
(3) Adaptingively integrating information from both the visual and semantic modalities.
Supplementary Material: zip
Submission Number: 3931
Loading