Abstract: Vision-language models (VLMs) such as CLIP have emerged as extremely strong zero-shot and few-shot image classifiers.
However, these models are often too expensive or cumbersome for resource constrained downstream applications.
In this work, we examine how to best leverage the strength of pretrained VLMs: by extracting $\textit{task-specific}$ information in order to obtain a small model that can be deployed in a very specific and low-resource setting.
We present the SIDCLIP method, a novel training pipeline which drastically improves the performance of small, efficient models, such as EfficientNet B0.
The pipeline includes three components that are critical to obtaining strong performance: 1) augmenting the classifier with $\textit{synthetic data}$ generated by leveraging CLIP itself; 2) $\textit{initializing}$ the modeling process using a smaller CLIP model pretrained on the target architecture; and 3) incorporating $\textit{knowledge distillation}$ to maximally mimic the performance of the larger model.
SIDCLIP improves the performance of an EfficientNet B0 model by an average of $50\%$ on 1-shot versions of four datasets and by an average of $26\%$ on the 8-shot versions, relative to directly trained networks, additionally approaching CLIP's linear probe performance while using a model with less than $2\%$ of the parameters of CLIP ViT-L/14's image encoder.
We hope our work can be useful as a practical guide for leveraging the power of foundation models in downstream data-scarce and budget constrained settings.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Massimiliano_Mancini1
Submission Number: 7341
Loading