Global and Fine-Grained Framework for CLIP with Cross-Modal Mamba in Few-Shot Image Classification

Junchen Cai; Zhi Chen; Qiaoqin Li; Rongyao Hu; Yongguo Liu

Global and Fine-Grained Framework for CLIP with Cross-Modal Mamba in Few-Shot Image Classification

Junchen Cai, Zhi Chen, Qiaoqin Li, Rongyao Hu, Yongguo Liu

19 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: CLIP, Multimodality, Few-Shot Learning, Mamba

Abstract: CLIP is a highly efficient cross-modal text-image embedding model with remarkable generalization ability. However, the encoders in CLIP usually operate independently without dynamic cross-modal interaction, leading to suboptimal performance in few-shot classification. Therefore, we propose a Global and Fine-Grained Framework for CLIP with Cross-Modal Mamba in Few-Shot Image Classification (GF4FC). Specifically, the CLIP with Cross-Modal Mamba module (CLIMA) is conducted to leverage Transformer and Vision-Transformer to interdependently encode text and image. These cross-modal representations then serve as mutual prompts to refine the embedding space, while the proposed Cross-Modal Mamba module ensures efficient time complexity. Moreover, we design a Fine-Grained Capture module (FGC) to enhance CLIMA's image representations using a Vssm module to extract prior fine-grained information. Furthermore, the Local Feature Supplementation (LFS) module is conducted to supplement CLIP's logits with FGC-derived fine-grained representations through a residual structure. Finally, the Adaptive Logits Fusion module is constructed to dynamically fuses logits using learned adaptive weights. Experiments on seven datasets demonstrate that GF4FC achieves superior performance compared with state-of-the-art methods in few-show image classification.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 14998

Loading