Abstract: State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et. al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge. We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Specifically, we replace the multi-head attention by multiple self-attention branches that the model learns to combine during the training process. Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT 2014 English-to-German translation task and by 0.4 on the English-to-French translation task.
TL;DR: Using branched attention with learned combination weights outperforms the baseline transformer for machine translation tasks.
Keywords: transformer, branching, attention, machine translation
Code: [![Papers with Code](/images/pwc_icon.svg) 5 community implementations](https://paperswithcode.com/paper/?openreview=SkYMnLxRW)
Data: [WMT 2014](https://paperswithcode.com/dataset/wmt-2014)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 5 code implementations](https://www.catalyzex.com/paper/weighted-transformer-network-for-machine/code)
11 Replies
Loading