Paper Link: https://openreview.net/forum?id=-ZsTBN3IAwW
Paper Type: Long paper (up to eight pages of content + unlimited references and appendices)
Abstract: k-nearest-neighbor machine translation ($k$NN-MT), proposed by Khandelwal et al. (2021), has achieved many state-of-the-art results in machine translation tasks. Although effective, $k$NN-MT requires conducting $k$NN searches through the large datastore for each decoding step during inference, prohibitively increasing the decoding cost and thus leading to the difficulty for the deployment in real-world applications. In this paper, we propose to move the time-consuming $k$NN search forward to the preprocessing phase, and then introduce $k$ Nearest Neighbor Knowledge Distillation ($k$NN-KD) that trains the base NMT model to directly learn the knowledge of $k$NN. Distilling knowledge retrieved by $k$NN can encourage the NMT model to take more reasonable target tokens into consideration, thus addressing the overcorrection problem. Extensive experimental results show that, the proposed method achieves consistent improvement over the state-of-the-art baselines including $k$NN-MT, while maintaining the same training and decoding speed as the standard NMT model.
Presentation Mode: This paper will be presented virtually
Virtual Presentation Timezone: UTC+8
Copyright Consent Signature (type Name Or NA If Not Transferrable): Zhixian Yang
Copyright Consent Name And Address: Peking University; Yiheyuan Road No. 5, Haidian District, Peking University, Beijing, China