Abstract: Transfer learning leverages large-scale pretraining to adapt models to specific downstream
tasks. It has emerged as a powerful and widely adopted training strategy in deep learning
frameworks. So, what makes it effective? Prior research has attributed its success to feature
reuse, pretrained weights reuse, domain alignment, and the transfer of low-level data statistics.
This study goes beyond these perspectives and focuses on a more fundamental factor:
the evolution of logits distribution within the latent feature space of pretrained models. We
introduce a novel approach using the Wasserstein distance to track distributional changes in
the latent features. We find that pretraining not only learns the input distributions but also
transforms them into generalizable internal representations in a consistent manner across
all frozen layers. This finding underpins the effectiveness of transfer learning and provides
a unifying explanation for these established theoretical perspectives.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Zhihui_Zhu1
Submission Number: 5506
Loading