MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover's Distance

Jiahui Huang, Bin Cao, Jiaxing Wang, Jing Fan

Published: 2021, Last Modified: 21 Jan 2026CollaborateCom (2) 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the past three years, the pre-trained language model is widely used in various natural language processing tasks, which has achieved significant progress. However, the high computational cost has seriously affected the efficiency of the pre-trained language model, which severely impairs the application of the pre-trained language model in resource-limited industries. To improve the efficiency of the model while ensuring the model’s accuracy, we propose MS-BERT, a multi-layer self-distillation approach for BERT compression based on Earth Mover’s Distance (EMD), which has the following features: (1) MS-BERT allows the lightweight network (student) to learn from all layers of the large model (teacher). In this way, students can learn different levels of knowledge from the teacher, which can enhance students’ performance. (2) Earth Mover’s Distance (EMD) is introduced to calculate the distance between the teacher layers and the student layers to achieve multi-layer knowledge transfer from teacher to students. (3) Two design strategies of student layers and the top-K uncertainty calculation method are proposed to improve MS-BERT’s performance. Extensive experiments conducted on different datasets have proved that our model can be 2 to 12 times faster than BERT under different accuracy losses.

External IDs:dblp:conf/colcom/HuangCWF21