Towards Full Utilization on Mask Task for Distilling PLMs into NMT

Anonymous

Towards Full Utilization on Mask Task for Distilling PLMs into NMT

Anonymous

17 Sept 2021 (modified: 05 May 2023)ACL ARR 2021 September Blind SubmissionReaders: Everyone

Abstract: Owing to being well-performed in many natural language processing tasks, the application of pre-trained language models (PLMs) in neural machine translation (NMT) is widely concerned. Knowledge distillation (KD) is one of the mainstream methods which could gain considerable promotion for NMT models without extra computational costs. However, previous methods in NMT always distill knowledge at hidden states level and can not make full use of the teacher models. For solving the aforementioned issue, we propose KD based on mask task as a more effective method utilized in NMT which includes encoder input conversion, mask task distillation, and gradient optimization mechanism. Here, we evaluate our translation systems for English→German and Chinese→English tasks and our methods clearly outperform baseline methods. Besides, our framework can get great performances with different PLMs.

0 Replies

Loading