- TL;DR: This paper propose a new model which combines multi scale information for sequence to sequence learning.
- Abstract: Transformers have achieved state-of-the-art results on a variety of natural language processing tasks. Despite good performance, Transformers are still weak in long sentence modeling where the global attention map is too dispersed to capture valuable information. In such case, the local/token features that are also significant to sequence modeling are omitted to some extent. To address this problem, we propose a Multi-scale attention model (MUSE) by concatenating attention networks with convolutional networks and position-wise feed-forward networks to explicitly capture local and token features. Considering the parameter size and computation efficiency, we re-use the feed-forward layer in the original Transformer and adopt a lightweight dynamic convolution as implementation. Experimental results show that the proposed model achieves substantial performance improvements over Transformer, especially on long sentences, and pushes the state-of-the-art from 35.6 to 36.2 on IWSLT 2014 German to English translation task, from 30.6 to 31.3 on IWSLT 2015 English to Vietnamese translation task. We also reach the state-of-art performance on WMT 2014 English to French translation dataset, with a BLEU score of 43.2.
- Keywords: Attention, Sequence to sequence learning, Deep neural networks, Machine Translation, Natural Language Processing