Abstract: In the realm of CLIP adaptation through prompt learning, it is important to emphasize the pivotal role that the proper alignment of visual and textual representations plays when adapting the CLIP to downstream tasks. We propose that the proper alignment for downstream tasks is determined by the $\textbf{flexibility}$ of the interaction between cross-modal information, which compensates for the absence of contrastive loss during the adaptation process. However, the current prompt learning methods, such as isolated modifications to the visual or language branches of CLIP or the employment of uni-directional cross-modal fusion, are not sufficient to explore the full potential of the mutual interaction between visual and textual modalities. To overcome this limitation, we propose a new paradigm for the CLIP prompt learning community, named $\textbf{B}$i$\textbf{l}$ateral Adaptive Cr$\textbf{o}$ss-Modal Fusi$\textbf{o}$n Pro$\textbf{m}$pt Learning~($\textit{Bloom}$), which includes two enhancements. First, we propose using projection functions for bi-directional modality transformation and fusion functions to encourage the mutual interaction between corresponding layers within both the image and text encoders. Second, we propose an adaptive manner that automatically searches the optimal combination of cross-modal information at each layer. These two improvements ensure a more efficient and flexible integration of the two modalities, thereby achieving proper alignment for specific downstream tasks. We put our method to the test in terms of base-to-novel, cross-dataset, and cross-domain evaluations on 15 image classification datasets. The results demonstrate a significant performance enhancement achieved by $\textit{Bloom}$.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion, [Content] Vision and Language
Relevance To Conference: a) We propose a new Bilateral Adaptive Cross-Modal Fusion Prompt Learning~(Bloom) paradigm for prompt learning that explores flexible cross-modal interactions to attain appropriate alignment for specific downstream tasks.
b) We propose an adaptive cross-modal fusion function to ensure our proposed Bloom to automatically search the optimal combination of cross-modal information, which further enhances the flexibility of prompt learning.
c) Through extensive experimentation, we demonstrate that Bloom significantly advances the current state of multimodal prompt learning, achieving new state-of-the-art results across 15 image classification datasets.
Submission Number: 3059
Loading