Unveiling Transfer Learning Effectiveness Through Latent Feature Distributions

TMLR Paper5506 Authors

30 Jul 2025 (modified: 08 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transfer learning leverages large-scale pretraining to adapt models to specific downstream tasks. It has emerged as a powerful and widely adopted training strategy in deep learning frameworks. So, what makes it effective? Prior research has attributed its success to feature reuse, pretrained weights reuse, domain alignment, and the transfer of low-level data statistics. This study goes beyond these perspectives and focuses on a more fundamental factor: the evolution of logits distribution within the latent feature space of pretrained models. We introduce a novel approach using the Wasserstein distance to track distributional changes in the latent features. We find that pretraining not only learns the input distributions but also transforms them into generalizable internal representations in a consistent manner across all frozen layers. This finding underpins the effectiveness of transfer learning and provides a unifying explanation for these established theoretical perspectives.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Zhihui_Zhu1
Submission Number: 5506
Loading