Abstract: There is mounting evidence that pretraining can be valuable for neural network language understanding models, but we do not yet have a clear understanding of how the choice of pretraining objective affects the type of linguistic information that models learn. With this in mind, we compare four objectives---language modeling, translation, skip-thought, and autoencoding---on their ability to induce syntactic and part-of-speech information, holding constant the genre and quantity of training data. We find that representations from language models consistently perform best on our syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data, which suggests that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information. We also find that a randomly-initialized, frozen model can perform strikingly well on our auxiliary tasks, but that this effect disappears when the amount of training data for the auxiliary tasks is reduced.
Keywords: representation learning, recurrent neural networks, syntax, part-of-speech tagging
TL;DR: Representations from language models consistently perform better than translation encoders on syntactic auxiliary prediction tasks.