KNN Transformer with Pyramid Prompts for Few-Shot Learning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate significant gains over the state-of-the-art methods, especially for the 1-shot task with 2.28% improvement on average due to semantically enhanced visual representations.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work significantly contributes to multimodal processing by proposing a K-NN Transformer with Pyramid prompts (KTPP) that effectively integrates visual and textual modalities for Few-Shot Learning (FSL). The core of the KTPP is K-NN Context Attention (KCA) and Pyramid Cross-modal Prompts (PCP). The KCA selectively focuses on the most relevant tokens for computing the attention matrix in the visual modality and reducing noise from irrelevant data, which is crucial for understanding multimedia content. The PCP establishes pyramid prompts for enhancing visual features via cross-modal attention between textual embeddings and multi-scale visual features. These cross-modal interactions fully employ the information across modalities, enabling the ViT to dynamically adjust the importance of visual features based on semantics and improve adaptability to spatial variation. Subsequently, the enhanced visual features interact with class-aware prompts along the spatial dimension to highlight semantically critical tokens. Ultimately, The ViT establishes a strong correlation between modalities. By bridging inherent distribution gaps between visual and textual modalities and fully leveraging multimodal complementarity, KTPP offers a robust solution for the comprehension of multimodal information, showing significant improvements in FSL tasks.
Supplementary Material: zip
Submission Number: 1196
Loading