Abstract: Multi-headed attention (MHA) is crucial to several modern NLP models. In "Are Sixteen Heads Really Better than One?", Michel et al., 2019 aim to improve our understanding of when and where MHA is important through a series of experiments involving pruning attention heads. In this paper, we reproduce the authors' experiments. Our results are broadly supportive of their conclusions; many attention heads can be ablated without a noticeable impact on performance, the encoder-decoder attention mechanism benefits the most from MHA, and important heads are determined in the early stages of training.
Track: Ablation
NeurIPS Paper Id: /forum?id=ByxXhSBgIS
5 Replies
Loading