Fully fine-tuned CLIP models are efficient few-shot learners

Published: 01 Jan 2025, Last Modified: 15 Jul 2025Knowl. Based Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Prompt tuning, which involves training a small set of parameters, effectively enhances the pre-trained Vision-Language Models (VLMs) to downstream tasks. However, this approach often comes at the cost of flexibility and adaptability when the tuned models are applied to different datasets or domains. In this paper, we revisit the vanilla full fine-tuning for VLMs and show that full fine-tuning is more efficient than prompt tuning under data-limited scenarios. To mitigate the overfitting and catastrophic forgetting issues encountered when fine-tuning the entire VLMs for specific tasks under limited supervision, we propose a framework named CLIP-CITE via designing a discriminative visual-text task, further aligning the visual-text semantics in a supervision manner, and integrating knowledge distillation techniques to preserve the gained knowledge. Extensive experimental results under few-shot learning, base-to-new generalization, domain generalization, and cross-domain generalization settings, demonstrate that our method effectively enhances the performance on specific tasks under limited supervision while preserving the versatility of the VLMs on other datasets.
Loading