What is being transferred in transfer learning in NLP?

Nir Yellinek, Raja Giryes

16 Jan 2022 (modified: 21 Jan 2022)OpenReview Archive Direct UploadReaders: Everyone

Abstract: Despite vast and successful usage of transfer learning in NLP applications, we yet do not fully understand what enables a successful transfer and which part of the network is responsible for that. In this paper, we address these fundamental questions for the transfer of word embeddings. We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors vs. the same CNN architecture trained on top of randomly initialized word vectors for multi-class question classification. Through a series of analyses on word vectors transferring to block-shuffled sentences, we separate the effect of semantic feature reuse from learning low-level statistics of data. We show that some of the benefits of transfer learning come from learning low-level data statistics. We also show that when training from pre-trained word vectors, the models stay in the same basin of the loss landscape and different instances of such models are more similar in feature space and closer in parameter space than models trained with randomly initialized word vectors.

0 Replies