Keywords: Transfer Learning, Infinite Width, Kernel Methods;
Abstract: We develop a theory of transfer learning in infinitely wide neural networks where both the pretraining (source) and downstream (target) task can operate in a feature learning regime. We analyze both the Bayesian framework, where learning is described by a posterior distribution, and gradient flow training of randomly initialized networks trained with weight decay. Both settings track how representations evolve in both source and target phases. The summary statistics of these theories are adapted feature kernels which, after transfer learning, depend on data and labels from both source and target tasks. Reuse of features during transfer learning is controlled by an elastic weight coupling which controls the reliance of the network on features learned during training on the source task. We apply our theory to linear and polynomial regression tasks as well as real datasets. Our theory and experiments reveal interesting interplays between elastic weight coupling, feature learning strength, dataset size and source and target task alignment on the utility of transfer learning.
Primary Area: learning theory
Submission Number: 20529
Loading