Abstract: Few-shot fine-grained image classification aims to use only few labelled samples to successfully recognize subtle sub-classes within the same parent class. This task is extremely challenging, due to the co-occurrence of large inter-class similarity, low intra-class similarity, and only few labelled samples. In this paper, to address these challenges, we propose a new Channel-Spatial Cross-Attention Module (CSCAM), which can effectively drive a model to extract discriminative fine-grained feature representations with only few shots. CSCAM collaboratively integrates a channel cross-attention module and a spatial cross-attention module, for the attentions across support and query samples. In addition, to fit for the characteristics of fine-grained images, a support averaging method is proposed in CSCAM to reduce the intra-class distance and increase the inter-class distance. Extensive experiments on four few-shot fine-grained classification datasets validate the effectiveness of CSCAM. Furthermore, CSCAM is a plug-and-play module, conveniently enabling effective improvement of state-of-the-art methods for few-shot fine-grained image classification.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work contributes to multimedia/multimodal processing by proposing a novel Channel-Spatial Collaborative Attention Module (CSCAM) for few-shot fine-grained image classification. Few-shot fine-grained image classification is a highly challenging task due to the presence of large inter-class similarity, low intra-class similarity, and a limited number of labeled samples.
CSCAM consists of a cross-channel attention module (CCAM) and a spatial-cross attention module (SCAM). The outputs of these modules are integrated to collaboratively refine the query representation. Specifically, CSCAM captures the channel and spatial correlations between query and support features, effectively utilizing feature information, and reducing the model's reliance on a large number of samples.
Furthermore, CSCAM employs the support average method to average the attention scores of different class support features, considering the characteristics of fine-grained images. This reduces the weight of similar areas between class support and query features, decreases the intra-class distance, and increases the weight of similar areas between heterogeneous support and query features, extending the inter-class distance. As a result, the proposed method improves the performance of existing few-shot fine-grained image classification methods.
Extensive experimental studies on four few-shot fine-grained classification datasets validate the effectiveness of the proposed method. Moreover, the proposed module is plug-and-play, enabling convenient augmentation of the performance of state-of-the-art methods for few-shot fine-grained image classification.
Submission Number: 1547
Loading