Proactive Sequence Generator via Knowledge AcquisitionDownload PDF

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Blind SubmissionReaders: Everyone
  • Original Pdf: pdf
  • Keywords: neural machine translation, knowledge distillation, exposure bias, reinforcement learning
  • TL;DR: We develop a knowledge acquisition framework to transfer knowledge from larger sequence models to small models, which helps to alleviate exposure bias. We observed +0.7-1.1 BLEU gains on benchmark datasets
  • Abstract: Sequence-to-sequence models such as transformers, which are now being used in a wide variety of NLP tasks, typically need to have very high capacity in order to perform well. Unfortunately, in production, memory size and inference speed are all strictly constrained. To address this problem, Knowledge Distillation (KD), a technique to train small models to mimic larger pre-trained models, has drawn lots of attention. The KD approach basically attempts to maximize recall, i.e., ranking Top-k”tokens in teacher models as higher as possible, however, whereas precision is more important for sequence generation because of exposure bias. Motivated by this, we develop Knowledge Acquisition (KA) where student models receive log q(y_t|y_{<t},x) as rewards when producing the next token y_t given previous tokens y_{<t} and the source sentence x. We demonstrate the effectiveness of our approach on WMT’17 De-En and IWSLT’15 Th-En translation tasks, with experimental results showing that our approach gains +0.7-1.1 BLEU score compared to token-level knowledge distillation.
10 Replies