Bootstrap Prompt Learning with Feature Adaptation for Vision-Language Efficient Tuning

Bootstrap Prompt Learning with Feature Adaptation for Vision-Language Efficient Tuning

ICLR 2026 Conference Submission18809 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-language foundation model, parameter-efficient tuning, prompt learning, adapter tuning

Abstract: Prompt learning is widely adopted for fine-tuning vision-language foundation models such as CLIP and offers strong generalization ability by inserting learnable embeddings in the input space for pre-adjustment. However, existing methods usually suffer from limited fitting capacity and heavily rely on biased exclusive cross entropy loss that compromises the generalization to unseen classes. To address these problems, in this paper, we propose the first framework named ada\textbf{P}ter bootstr\textbf{A}pped prompt contrastive \textbf{T}uning (PAT) to integrate the superior fitting capacity of post-adjustment via adapters into prompt learning. Specifically, we bootstrap prompt learning with adapters and achieves pre-post alignment to achieve a more effective trade-off between fitting capability and generalization ability. Furthermore, we propose a tolerance regularization that equally pushes away all negative samples and improves generalization by introducing additional categories of unlabeled data to avoid overfitting. To our best knowledge, this is the first successful attempt to simultaneously exploit the advantages of prompt learning and adapter tuning. Extensive evaluations demonstrate that PAT achieves state-of-the-art performance in various recognition tasks on three prevailing benchmarks.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 18809

Loading