Don't Pre-train, Teach Your Small Model

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Small models, supervised learning, pre-training, finetuning, knowledge distillation, low cost, synthetic dataset, contrastive learning, diffusion models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: For small models, instead of pre-training, we can knowledge distill from a public pre-trained teacher on the task dataset augmented with synthetic samples to achieve accuracies better than pre-training then finetuning at a much lower training cost.
Abstract: In this paper, we reconsider the question: What is the most effective way to train a small model? A standard approach is to train it from scratch in a supervised manner on the desired task for satisfactory results at a low cost. Alternatively, one can first pre-train it on a large foundation dataset and then finetune it on the downstream task to obtain strong performance, albeit at a much higher total training cost. Is there a middle way that balances high performance with low resources? We find the answer to be yes. If, while training from scratch, we regularize the feature backbone (and optionally task-specific head) to match an existing pre-trained one on the relevant subset of the data manifold, a small model can achieve similar or better performance than if it was completely pre-trained and finetuned. We achieve this via a novel knowledge distillation loss based on the Alignment/Uniformity theory of contrastive learning by Wang & Isola (2020), which we use to transfer the knowledge of the task dataset augmented with synthetic inputs generated from existing pre-trained diffusion models. Across 6 image recognition datasets, utilizing pre-trained convolution and attention-based teachers from public model hubs, we show significant improvements to small model performance at a slightly higher cost than supervised learning from scratch. Seeing as our method can hold its weight against, and often surpass, the pre-training regime, we refer to our paradigm as: Don’t Pre-train, Teach (DPT).
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4058
Loading