Abstract: Prompt tuning has been successfully used in leveraging the knowledge of Large-scale Vision-Language Pre-trained (VLP) models on downstream tasks. Most existing prompt tuning approaches learn prompts by maximizing the pairwise similarity. Although samples in different modalities might be relatively aligned pairwisely, such alignment does not fully utilize the information between samples, which can be less consistent on the modality level. In this paper, we propose a novel prompt tuning strategy by distributionally matching different modalities. Specifically, we minimize the distribution-wise distance between the image and text modalities with optimal transport (OT) theory. Simultaneously, we add a constraint on the learned transport plan during the modality matching to enhance the learning of vision and text prompts. Our proposed one can be applied to improve existing uni-modal and multi-modal prompt learning methods for being a plug-and-play method, which can generate modality-consistent representations. Experiments on eleven public datasets demonstrate that our proposed method has excellent performance, achieving substantial improvements on both uni-modal and multi-modal prompt tuning methods.
Loading