Deep-to-bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation

Anonymous

Deep-to-bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: There are millions of parameters and huge computational power consumption behind the outstanding performance of pre-trained language models in natural language processing tasks. Knowledge distillation is considered as a compression strategy to address this problem. However, previous works (i) distill partial transformer layers of the teacher model, which ignore the importance of bottom base information, or (ii) neglect the difficulty differences of knowledge from deep to shallow, which corresponds to different level information of teacher model. We introduce a deep-to-bottom weights decay review mechanism to knowledge distillation, which fuses teacher-side information taking each layer’s difficulty level into consideration. To validate our claims, we distill a 12-layer BERT into a 6-layer model and evaluate it on the GLUE dataset. Experimental results show that our review approach is able to outperform other existing techniques.

0 Replies

Loading