Abstract: Multi-head attention plays a crucial role in the recent success of Transformer, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that its effectiveness stems from attending to information from multiple representation subspaces. In this paper, we first demonstrate that using multiple subspaces is not a unique feature of multi-head attention, as multi-layer single-head attention also leverages multiple subspaces. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has fewer layers than the single-head attention when using the same number of subspaces. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer have roughly the same model size and employ the same total subspace number (attention head number), while the multi-head one is significantly shallower. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the deep single-head Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformers achieve consistent performance improvements.
Paper Type: short
0 Replies
Loading