Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning

Published: 01 Jan 2024, Last Modified: 27 Jul 2025PRCV (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large-scale pretrained visual-language models like CLIP have proven highly effective in learning universal representations and achieved significant success across various downstream tasks. Recently, there has been increased interest in fine-tuning these large models with limited data. To alleviate the overfitting issue during fine-tuning, existing methods usually freeze the parameters of CLIP pretrained on large-scale datasets and use the features extracted from the CLIP model for downstream tasks. However, such fine-tuning strategy may limit the performance of these methods because semantic visual features specific for downstream tasks may not be well extracted based on the frozen feature extractor of CLIP. In this study, we propose an effective framework to fine-tune CLIP with few-shot samples and meanwhile alleviate the overfitting. In this framework, a visual adapter is embedded at the end of the CLIP’s visual encoder to encourage the model to effectively extract semantic features relevant to the downstream task, and a supervised contrastive loss is introduced to alleviate the overfitting by guiding the optimisation to focus on the adapter. In addition, the multimodal feature alignment capability of CLIP is utilized to direct the adapted visual encoder in extracting class-relevant image features through textual prompts. Experimental evaluations on 11 datasets confirm the superior performance of the proposed approach.
Loading