Multi-head or Single-head? An Empirical Comparison for Transformer Training

Anonymous

Multi-head or Single-head? An Empirical Comparison for Transformer Training

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Multi-head attention plays a crucial role in the recent success of Transformer, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that its effectiveness stems from the ability to attend multiple positions jointly. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions. Then, we suggest the main advantage of multi-head attention is the training stability, since it has fewer layers than the single-head attention when attending the same number of positions. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the deep single-head Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformers achieve consistent performance improvements.

0 Replies

Loading