Keywords: pre-training, transfer learning, data curation, CLIP, supervised learning, self-supervised learning, LAION
TL;DR: We explore the role of pre-training data with respect to its distribution, size, source, and curation method in transfer performance
Abstract: We explore which pre-training dataset should be used to achieve the best transfer learning performance. We investigate the impact of pre-training on the few-shot and full fine-tuning performance using 7 pre-training datasets, and 9 downstream datasets. Through extensive controlled experiments, we find that the choice of the pre-training dataset is essential for the few-shot transfer, but its role decreases as more data is made available for fine-tuning. Additionally, we explore the role of data curation and examine the trade-offs between label noise and the size of the pre-training dataset. We find that using 2000× more pre-training data from LAION can match the performance of supervised ImageNet pre-training.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/the-role-of-pre-training-data-in-transfer/code)
0 Replies
Loading