- TL;DR: We develop a task-agnosticlly compressed BERT, which is 4.3x smaller and 4.0x faster than BERT-BASE while achieving competitive performance on GLUE and SQuAD.
- Abstract: The recent development of Natural Language Processing (NLP) has achieved great success using large pre-trained models with hundreds of millions of parameters. However, these models suffer from the heavy model size and high latency such that we cannot directly deploy them to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like BERT, MobileBERT is task-agnostic; that is, it can be universally applied to various downstream NLP tasks via fine-tuning. MobileBERT is a slimmed version of BERT-LARGE augmented with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we use a bottom-to-top progressive scheme to transfer the intrinsic knowledge of a specially designed Inverted Bottleneck BERT-LARGE teacher to it. Empirical studies show that MobileBERT is 4.3x smaller and 4.0x faster than original BERT-BASE while achieving competitive results on well-known NLP benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves 0.6 GLUE score performance degradation, and 367 ms latency on a Pixel 3 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a 90.0/79.2 dev F1 score, which is 1.5/2.1 higher than BERT-BASE.
- Keywords: BERT, knowledge transfer, model compression