Keywords: distillation, few-shot, compression, CLIP, foundation models
Abstract: Vision-language models (VLMs) have emerged as extremely strong zero-shot and few-shot image classifiers, performing on par with task-specific models. However, they can be unnecessarily heavy-weight for task-specific downstream applications. While existing lines of work have successfully compressed VLMs and other foundation models to varying degrees, most focus on preserving the generality of these models, rather than leveraging their power for a particular task.
In this work, we focus on the setting in which we have a limited amount of data on a downstream image classification task and a limited inference budget.
To satisfy these constraints, we focus on distilling the strong few-shot performance of CLIP on image classification tasks into a more efficient model.
We introduce the SIDCLIP (Synthesize-Initialize-Distill CLIP) method and highlight its three components that are critical to obtaining strong performance: 1) augmenting the classifier with \textit{synthetic data} generated by leveraging CLIP itself; 2) \textit{initializing} the modeling process using a smaller CLIP model pretrained on the target architecture; and 3) incorporating \textit{knowledge distillation} to maximally mimic the performance of the larger model.
Our set of proposed strategies produces a compact model that performs within 16\% and 10\% of CLIP's linear probe performance on 1 and 8 shot datasets respectively, while using a model with less than 2\% of the parameters of CLIP's image encoder.
We hope our work can be useful as a practical guide for leveraging the power of foundation models in downstream data-scarce and budget constrained settings.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3944
Loading