['1c1', '< Title: On the Connection between Pre-training Data Diversity and Fine-tuning Robustness', '---', '> Title: Pre-training Data Quantity Dominates Over Diversity for Fine-tuning Robustness', '3c3', '< Abstract: Pre-training has been widely adopted in deep learning to improve model performance, especially when the training data for a target task is limited. In our work, we seek to understand the implications of this training strategy on the generalization properties of downstream models. More specifically, we ask the following question: how do properties of the pre-training distribution affect the robustness of a finetuned model? The properties we explore include the label space, label semantics, image diversity, data domains, and data quantity of the pre-training distribution. We find that the primary factor influencing downstream effective robustness [44] is data quantity, while other factors have limited significance. For example, reducing the number of ImageNet pre-training classes by 4⇥ while increasing the number of images per class by 4⇥ (that is, keeping total data quantity fixed) does not impact the robustness of fine-tuned models. We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources, primarily using the iWildCam-WILDS distribution shift as a test for robustness.', '---', '> Abstract: Pre-training is a cornerstone of deep learning, significantly enhancing model performance, particularly in data-scarce scenarios. While its benefits are clear, the precise influence of pre-training data characteristics on the generalization and robustness of fine-tuned models remains an open question. This work investigates how various properties of the pre-training distribution impact the effective robustness of downstream models against natural distribution shifts. We systematically explore the effects of label space, label semantics, image diversity, data domains, and data quantity. Our key finding is that pre-training data quantity is the primary determinant of downstream effective robustness [44], with other factors exhibiting limited significance. For instance, maintaining a fixed total data quantity, a four-fold reduction in ImageNet pre-training classes coupled with a four-fold increase in images per class does not compromise the robustness of fine-tuned models. We validate these findings across pre-training distributions derived from diverse natural and synthetic data sources, primarily employing the iWildCam-WILDS distribution shift as a robust evaluation benchmark.', '6,16c6,17', '< Transfer learning is a popular technique to deal with data scarcity, improve training speed, or transfer useful inductive biases that can benefit downstream tasks [28,8,10]. In the domain of computer vision, pre-training on ImageNet in particular has been the de-facto standard for obtaining features to solve a wide range of vision tasks, such as object detection [34,9,18], segmentation [5,16], and action recognition [42]. While there exists previous work that seeks to pinpoint specific properties of ImageNet-trained features that benefit downstream performance [19,24,23,38], the analysis is often done with respect to model accuracy. Our work instead examines the robustness of fine-tuned models to natural distribution shifts. Instead of looking at architecture variations and pre-training algorithm as done in prior work [38,14,49], we focus on the role of the pre-training data. This data-centric approach has been validated by past work [27,29,13], which show that the training data distribution plays a larger role than training methods or architecture in influencing model robustness.', '< Robustness under distribution shifts is a fundamental concern for producing reliable machine learning systems: a model can perform in unexpected and undesirable ways when there is a mismatch between the data distribution encountered in deployment and the one on which the model is trained [36,22]. For example, a self-driving car should be able to generalize to a wide variety of weather scenarios to be considered safe, some of which it may not have seen during training. In our work, we focus on these forms of natural distribution shifts [33,22]-named so because they are induced by real-world processes-and study what aspects of the source dataset could help fine-tuned models become more robust to these shifts. We tackle this question along five different ablation axes: (i) Data quantity, (ii) Label granularity, (iii) Label semantics, (iv) Image diversity, and (v) Data sources. Through a Figure 1: A summary of our experimental pipeline. We pre-train a model on a variety of different data distributions and evaluate its effective robustness after fine-tuning on a downstream task (i.e., iWildCam). By examining many models in this manner, we can determine empirical properties of the pre-training distribution that are important for fine-tuning robustness.', '< better understanding of the interplay between various properties of the pre-training distribution and downstream robustness, we seek to establish guidelines for constructing better pre-training datasets for fine-tuning.', '< Previous work by Miller et al. [27] experimented with a wide range of natural distribution shifts and found that pre-training on ImageNet yields the biggest improvement in robustness for the iWildCam-WILDS dataset [22,2]. Consequently, we use iWildCam-WILDS as a probe to evaluate how our interventions with the ImageNet pre-training distribution would alter the robustness trend uncovered in this previous work. We also analyze the use of other pre-training data sources that may differ significantly from ImageNet in both semantic content and data collection methodology. Our main findings can be summarized as follows:', '< (i) Data quantity. Pre-training with more data helps boost robustness. However, we do not need a lot of pre-training data to see significant robustness gains: using 25K images subsampled from either ImageNet or iNaturalist, which is 6⇥ smaller than the size of the fine-tuning dataset, already offers noticable robustness improvements.', '< (ii) Label granularity. Making labels more coarse-grained lowers transfer robustness. The effect is less significant than altering data quantity: extreme reduction in label granularity (e.g., using 5 coarse classes instead of 1000 fine-grained classes) still preserves some of the robustness gains compared to training from scratch.', '< (iii) Label semantics. Given enough data and labels, using more semantically similar classes does not have a notable impact on the robustness of fine-tuned models. In particular, we find that pre-training on the 600 inanimate object categories in ImageNet yields the same effective robustness as pre-training on the 400 animal categories, despite the fact that the downstream task consists of only animal categories.', '< (iv) Image diversity. Given the same pre-training label set and data quantity, increasing per-class diversity (e.g., by including more subclasses) has no effect on transfer robustness. In addition, the trade-off between having more classes and more images per class is not significant if total number of samples is kept constant.', '< (v) Data sources. We find that natural data sources (i.e., ImageNet, iNaturalist) yield similar downstream robustness when controlling for data quantity. Pre-training with synthetic fractal data is less effective at the same data quantity regime but still has some robustness gain to offer compared to training from scratch. Synthetic natural-looking data (e.g., generated by Stable Diffusion [35]) can help close this gap between using natural data and synthetic fractal data.', '< Overall we find that increasing pre-training data quantity and label granularity makes fine-tuned models more robust to distribution shifts. However, not all additional data is equally helpful. For instance, in the context of iWildCam-WILDS task, pre-training with natural-looking data offers much more robustness than using 10⇥ more synthetic fractal data.', '< Figure 2: Effective robustness is defined as movement towards a classifier which is robust to distribution shift (i.e., line y = x). Using this metric, Miller et al. [27] observes that for the iWildCam-WILDS task, models pre-trained on ImageNet are much more robust than models trained from scratch. We reproduce these two trends and use them as points of reference for our subsequent experiments, in which we modify the pre-training distribution and observe how our interventions alter the robustness trend lines.', '---', '> Transfer learning, particularly pre-training on large datasets like ImageNet, has become a cornerstone in deep learning for alleviating data scarcity, accelerating training, and imparting beneficial inductive biases for downstream tasks [28,8,10]. While much prior work has investigated the properties of pre-trained features that enhance downstream *accuracy* [19,24,23,38], the impact on *robustness* to natural distribution shifts remains less explored. This paper addresses this critical gap, shifting focus from architectural variations and pre-training algorithms [38,14,49] to the fundamental role of the pre-training data itself. A data-centric perspective is crucial, as empirical evidence suggests that the training data distribution often exerts a more profound influence on model robustness than algorithmic or architectural choices [27,29,13].', '> ', '> The ability of machine learning systems to maintain performance under real-world distribution shifts is paramount for their reliability and safe deployment. Models often exhibit unexpected and undesirable behaviors when the data encountered during inference deviates from the training distribution [36,22]. Consider autonomous vehicles, which must generalize across diverse, unseen weather conditions to ensure safety. Our research specifically addresses these "natural distribution shifts" [33,22]—those arising from real-world processes—by investigating how properties of the source dataset contribute to the robustness of fine-tuned models against such shifts. We systematically dissect this problem along five distinct ablation axes: (i) Data quantity, (ii) Label granularity, (iii) Label semantics, (iv) Image diversity, and (v) Data sources. As depicted in Figure 1, our experimental pipeline involves pre-training models on a variety of data distributions and subsequently evaluating their effective robustness after fine-tuning on the challenging iWildCam-WILDS dataset. This rigorous, data-centric methodology enables us to identify critical empirical properties of pre-training distributions that are essential for achieving robust fine-tuning and to formulate concrete guidelines for constructing superior pre-training datasets.', '> ', '> Previous work by Miller et al. [27] demonstrated that pre-training on ImageNet significantly boosts robustness for the iWildCam-WILDS dataset [22,2] across a wide range of natural distribution shifts. We leverage iWildCam-WILDS as a primary benchmark to evaluate how our interventions with pre-training distributions alter this established robustness trend. We also extend our analysis to other pre-training data sources that diverge significantly from ImageNet in semantic content and data collection methodology. Our main findings are summarized as follows:', '> (i) Data quantity. Increasing pre-training data quantity consistently enhances robustness. Notably, substantial robustness gains are observed even with relatively small pre-training datasets; for instance, 25K images subsampled from ImageNet or iNaturalist (6x smaller than the fine-tuning dataset) already yield noticeable improvements.', '> (ii) Label granularity. Coarse-grained labels during pre-training tend to reduce transfer robustness. However, this effect is less pronounced than that of data quantity. Even extreme reductions in label granularity (e.g., using 5 coarse classes instead of 1000 fine-grained classes) still preserve significant robustness benefits compared to training from scratch.', '> (iii) Label semantics. Provided sufficient data and labels, the semantic similarity between pre-training and downstream task classes has a limited impact on fine-tuned model robustness. Specifically, pre-training on 600 inanimate object categories from ImageNet yields comparable effective robustness to pre-training on 400 animal categories, despite iWildCam exclusively featuring animal categories.', '> (iv) Image diversity. When total data quantity and label sets are fixed, increasing per-class diversity (e.g., by including more subclasses) does not significantly affect transfer robustness. Furthermore, the trade-off between having more classes versus more images per class is negligible if the total number of samples remains constant.', '> (v) Data sources. Natural data sources (ImageNet, iNaturalist) yield similar downstream robustness when data quantity is controlled. Synthetic fractal data is less effective at comparable data quantities but still offers robustness gains over training from scratch. Critically, synthetic natural-looking data (e.g., generated by Stable Diffusion [35]) can bridge the robustness gap between natural and synthetic fractal data.', '> In summary, while increasing pre-training data quantity and label granularity are key drivers of robustness to distribution shifts, not all additional data is equally beneficial. For instance, in the iWildCam-WILDS task, pre-training with natural-looking synthetic data provides significantly more robustness than using tenfold more synthetic fractal data.', '> Figure 2: Effective robustness, defined as the movement towards an ideal robust classifier (i.e., the y = x line), is a key metric. Miller et al. [27] observed that for the iWildCam-WILDS task, ImageNet-pretrained models are considerably more robust than models trained from scratch. We reproduce these two distinct linear trends as baselines for our experiments, observing how our interventions in the pre-training distribution alter these robustness trends.', '167d167', '< ']
