Knowledge Distillation based Ensemble Learning for Neural Machine TranslationDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Knowledge Distillation, Ensemble Learning, Neural Machine Translation
Abstract: Model ensemble can effectively improve the accuracy of neural machine translation, which is accompanied by the cost of large computation and memory requirements. Additionally, model ensemble cannot combine the strengths of translation models with different decoding strategies since their translation probabilities cannot be directly aggregated. In this paper, we introduce an ensemble learning framework based on knowledge distillation to aggregate the knowledge of multiple teacher models into a single student model. Under this framework, we introduce word-level ensemble learning and sequence-level ensemble learning for neural machine translation, where sequence-level ensemble learning is capable of aggregating translation models with different decoding strategies. Experimental results on multiple translation tasks show that, by combining the two ensemble learning methods, our approach achieves substantial improvements over the competitive baseline systems and establishes a new single-model state-of-the-art BLEU score of 31.13 in the WMT14 English-German translation task.\footnote{We will release the source code and the created SEL training data for reproducibility.}
One-sentence Summary: We propose an ensemble learning method for NMT to aggregate the knowledge of multiple models into a single model.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): /references/pdf?id=FiOiVOa-4Y
10 Replies

Loading