Semantic-Aware Adaptation with Hierarchical Multimodal Prompts for Few-Shot Learning

Published: 2025, Last Modified: 09 Nov 2025ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Few-shot learning aims to recognize novel classes with limited labeled samples. Existing methods often utilize semantic information from natural language but integrate it after visual feature extraction, overlooking fine-grained cross-modal interactions. Moreover, they struggle with spatial variations, as target objects often appear in varying regions. To address these limitations, we propose Semantic-Aware Adaptation (SAA) which consists of Hierarchical Multimodal Prompts (HMP) and Global-Local Adaptation (GLA). SAA leverages textual prompts encoded by CLIP and adaptively modulated by learnable visual prompts to better align text and vision feature distributions. During visual extraction, these fused prompts are integrated with visual patches in channel and spatial dimensions, dynamically enhancing visual features, while a consistency loss is used to regularize them to prevent bias and overfitting. By the deep cross-modal interactions at different scales, HMP is thus constructed for improving robustness to spatial variations. In GLA, patch-level soft labels based on rich semantics further emphasize class-specific visual patches, improving token dependency learning. Experiments on five benchmarks demonstrate the effectiveness of our SAA, with an average accuracy improvement of 1.6% on challenging 1-shot tasks.
Loading