Towards Compact and Fast Neural Machine Translation Using a Combined Method

Xiaowei Zhang, Wei Chen, Feng Wang, Shuang Xu, Bo Xu

2017 (modified: 13 Nov 2024)EMNLP 2017Readers: Everyone

Abstract: Neural Machine Translation (NMT) lays intensive burden on computation and memory cost. It is a challenge to deploy NMT models on the devices with limited computation and memory budgets. This paper presents a four stage pipeline to compress model and speed up the decoding for NMT. Our method first introduces a compact architecture based on convolutional encoder and weight shared embeddings. Then weight pruning is applied to obtain a sparse model. Next, we propose a fast sequence interpolation approach which enables the greedy decoding to achieve performance on par with the beam search. Hence, the time-consuming beam search can be replaced by simple greedy decoding. Finally, vocabulary selection is used to reduce the computation of softmax layer. Our final model achieves 10 × speedup, 17 × parameters reduction, 35MB storage size and comparable performance compared to the baseline model.

0 Replies