MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer

Zhiqing Sun; Hongkun Yu; Xiaodan Song; Renjie Liu; Yiming Yang; Denny Zhou

MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

TL;DR: We develop a task-agnosticlly compressed BERT, which is 4.3x smaller and 4.0x faster than BERT-BASE while achieving competitive performance on GLUE and SQuAD.

Abstract: The recent development of Natural Language Processing (NLP) has achieved great success using large pre-trained models with hundreds of millions of parameters. However, these models suffer from the heavy model size and high latency such that we cannot directly deploy them to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like BERT, MobileBERT is task-agnostic; that is, it can be universally applied to various downstream NLP tasks via fine-tuning. MobileBERT is a slimmed version of BERT-LARGE augmented with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we use a bottom-to-top progressive scheme to transfer the intrinsic knowledge of a specially designed Inverted Bottleneck BERT-LARGE teacher to it. Empirical studies show that MobileBERT is 4.3x smaller and 4.0x faster than original BERT-BASE while achieving competitive results on well-known NLP benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves 0.6 GLUE score performance degradation, and 367 ms latency on a Pixel 3 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a 90.0/79.2 dev F1 score, which is 1.5/2.1 higher than BERT-BASE.

Keywords: BERT, knowledge transfer, model compression

Original Pdf: pdf

16 Replies

Loading