Prompt Tuning with Prompt-aligned Gradient for Vision-Language Models Download PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: prompt tuning, vision-language models, CLIP
Abstract: Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by ``prompt'', e.g., using the model provided similarity measure between an image and the prompt sentence ``$\texttt{a photo of a [CLASS]}$'', as the confidence score of predicting the image is ``$\texttt{[CLASS]}$''. Therefore, prompt shows a great potential for fast adapting the VLMs to downstream tasks if we fine-tune the prompt-based similarity measure. However, we find a common failure that improper fine-tuning may not only undermine the prompt's inherent prediction for the task-related classes, but also for other classes in the VLM vocabulary. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompt. We present Prompt-aligned Gradient, dubbed $\texttt{ProGrad}$, to prevent prompt tuning from forgetting the the general knowledge learned from VLMs. In particular, $\texttt{ProGrad}$ only updates the prompt whose gradient is aligned (or non-conflicting) to the ``general direction'', which is represented as the gradient of the KL loss of the pre-defined prompt prediction. Extensive experiments demonstrate the stronger few-shot generalization ability of $\texttt{ProGrad}$ over state-of-the-art prompt tuning methods. Codes are in Appendix.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
TL;DR: We present Prompt-aligned Gradient to prevent prompt tuning from forgetting the general knowledge learned from CLIP.
Supplementary Material: zip
18 Replies

Loading