ADAPT: Adaptive Prompt Tuning for Pre-Trained Vision-Language Models

TMLR Paper6927 Authors

08 Jan 2026 (modified: 25 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Prompt tuning has emerged as an effective way for parameter-efficient fine-tuning. Conventional deep prompt tuning inserts continuous prompts of a fixed context length into the input to each layer. When a pre-trained model is tailored to a specific downstream task, different layers initialized with pre-trained weights might have different levels of deviation from the optimal weights. Inserted prompts with a fixed context length might have redundant context tokens or insufficient context length. To address this issue, we propose a deep continuous prompting method dubbed Adapt that encourages heterogeneous context lengths. In this method, context lengths are automatically determined by iteratively pruning context tokens. We use the saliency criterion for neural network pruning to compute the importance scores of context tokens in order to determine which tokens to prune. To avoid the forgetting issue in the fine-tuning process, we apply the angular knowledge distillation to force the model to learn the angular separation between pairs of classes and that of instances from the pre-trained model. We examine the proposed method on the pre-trained vision-language model CLIP. 16-shot experiments on 11 downstream datasets reveal the advantage of Adapt: the average test accuracy achieves competitive performance, and the highest performance gain on individual datasets is 7.44%. We release the code in https://anonymous.4open.science/r/Adapt-Prompt-Release.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=fRJLduRxrz&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: We made the following changes since the last submission: 1. We added in Figure 4 the performance of using traditional knowledge distillation (KL divergence loss), which boosts the performance of the heterogeneous prompt configuration. 2. We added a "Limitation" section to explain why the proposed method requires that the number of shots cannot be very small to obtain good performance. 3. We added the statistical result on the comparison between our method and the best baseline method. Specifically, we conducted a Wilcoxon signed-rank test. The result indicates that our method statistically outperforms the best baseline method. 4. We added some clarifications to avoid potential confusion. For example, we emphasized that the efficiency improvement by pruning is based on the comparison of our method without pruning versus our method with pruning. We added the following paragraph to show that we applied a knowledge distillation approach that was "originally proposed" in PromptSRC: ``` PromptSRC (khattak, 2023) proposes a self-regulation loss on image and text embeddings for the prompt tuning. They test the performance of using different types of self-regulating losses and conclude that l-1 distance between the embeddings of the student model and the teacher model performs best in their setting. Following their setting, we test both cosine similarity and l-1 distance. We find that cosine similarity generates better performance in our setting. We denote this self-regulation loss as angular knowledge distillation loss. ``` Below is the one-to-one response to the Editor's comments: 1. Effect of angular KD Knowledge distillation is a common approach in parameter-efficient fine-tuning as the forgetting issue can be serious in the fine-tuning process. Hence, we applied knowledge distillation to boost the performance. The newly added Wilcoxon signed-rank test indicates that there is a statistically significant performance difference between our method and the PromptSRC method (the best baseline). When applying traditional knowledge distillation (KL divergence loss), we still observe a pronounced performance boost compared to pure heterogeneous pruning. The result is added to the updated Figure 4. 2. Performance gain ensured by statistical significance testing We did the Wilcoxon signed-rank test to show that our method statistically outperforms the best baseline method. 3. Comparing the performance using the number of shots being 16 Our method is essentially a bi-level optimization, which can underperform when the dataset size is very limited (e.g., the number of shots is very small). We added a Limitation Section to acknowledge this fact. 4. Ablation section on "with context pruning" vs "without context pruning" In the newly added Figure 14, we added a comparison between the performance with context pruning and without context pruning. The result indicates that using context pruning improves the performance. 5. We added the following paragraph to show that we applied a knowledge distillation approach that was "originally proposed" in PromptSRC: ``` PromptSRC (khattak, 2023) proposes a self-regulation loss on image and text embeddings for the prompt tuning. They test the performance of using different types of self-regulating losses and conclude that l-1 distance between the embeddings of the student model and the teacher model performs best in their setting. Following their setting, we test both cosine similarity and l-1 distance. We find that cosine similarity generates better performance in our setting. We denote this self-regulation loss as angular knowledge distillation loss. ```
Assigned Action Editor: ~Yingnian_Wu1
Submission Number: 6927
Loading