Keywords: Prompt tuning, Vision-language models, Contrastive learning
Abstract: Prompt tuning, which focuses on learning continuous text prompts for adapting large vision-language models, has attracted much attention in recent years. While prior works show promising performance over the hand-crafted prompts, they typically use cross-entropy loss for learning prompts, which limits their generalization capability in many real-world scenarios. Motivated by the effectiveness of contrastive learning for improved generalization, we introduce Contrastive Prompt Tuning (CPT), an incredibly simple yet highly efficient framework that explicitly optimizes for the learned prompts to be consistent with the image space. In particular, combined with cross-entropy loss, our contrastive losses help learning prompts so that the model has consistent predictions across different views of an image while also maintaining the consistency of pairwise similarities among different images. Extensive experiments on a battery of datasets demonstrate that our proposed method significantly outperforms the existing methods in improving model's generalization, while also achieving consistent improvements in few-shot in-domain performance for a wide variety of vision-language models.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
TL;DR: We introduce contrastive prompt tuning for improved generalization in vision-language models by optimizing for the learned prompts to be consistent with the image space.
5 Replies
Loading