Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

Iulia Turc; Ming-Wei Chang; Kenton Lee; Kristina Toutanova

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: NLP, self-supervised learning, language model pre-training, knowledge distillation, BERT, compact models

TL;DR: Studies how self-supervised learning and knowledge distillation interact in the context of building compact models.

Abstract: Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.

Code: [![github](/images/github_icon.svg) google-research/bert](https://github.com/google-research/bert) + [![Papers with Code](/images/pwc_icon.svg) 39 community implementations](https://paperswithcode.com/paper/?openreview=BJg7x1HFvB)

Data: [BookCorpus](https://paperswithcode.com/dataset/bookcorpus), [GLUE](https://paperswithcode.com/dataset/glue), [MultiNLI](https://paperswithcode.com/dataset/multinli), [SNLI](https://paperswithcode.com/dataset/snli), [SST](https://paperswithcode.com/dataset/sst), [SST-2](https://paperswithcode.com/dataset/sst-2)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 39 code implementations](https://www.catalyzex.com/paper/well-read-students-learn-better-on-the/code)

Original Pdf: pdf

8 Replies

Loading