MUOT-CLIP: Enhancing Few-Shot Adaptation of CLIP via Inter- and Intra- Modality Unbalanced Optimal Transport

16 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Few-Shot Classification, Prompt Learning
TL;DR: Enhancing the few-shot adaptation performance of CLIP via inter- and intra- modality unbalanced optimal transport.
Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable zero-shot capabilities across a variety of domains. To enhance its performance in data-scarce settings, few-shot adaptation methods have been developed. Other than fine-tuning the parameters (e.g., the adapter-based approach), prompt learning methods learn proper prompts to minimize the distance between the visual feature and the textual feature. Optimal Transport (OT) has proven highly effective as a measurement metric for evaluating the feature space of CLIP. However, classical OT, which forces equality constraints on both the source and target weights of the transport plan, is susceptible to noises (e.g., the misleading local regions in images and unrelated words in prompts). Furthermore, both the adapter-based and prompt learning methods usually overlook the modality gap existing in the feature space and thus risk to obtain suboptimal performance. In this paper, we extend the formulation of classical OT to unbalanced optimal transport (UOT) for better measurement. The UOT based distance measure can filter out noises adaptively. To boost the few-shot adaptation performance, a framework that measures both the inter- and intra- **M**odality distance based on **UOT** for **CLIP** is proposed, which is termed **MUOT-CLIP**. In addition, a scalable UOT solver with entropy regularization term is used for the efficient optimization of the model. Compared with the state-of-the-art methods, MUOT-CLIP consistently exhibits favorable performance on the few-shot classification benchmark of 11 datasets.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7191
Loading