Reproducing "Are Sixteen Heads Really Better than One?"

Stephen Zhao; Sherry Yuan

Reproducing "Are Sixteen Heads Really Better than One?"

Stephen Zhao, Sherry Yuan

29 Dec 2019 (modified: 05 May 2023)NeurIPS 2019 Reproducibility Challenge Blind ReportReaders: Everyone

Abstract: Multi-headed attention (MHA) is crucial to several modern NLP models. In "Are Sixteen Heads Really Better than One?", Michel et al., 2019 aim to improve our understanding of when and where MHA is important through a series of experiments involving pruning attention heads. In this paper, we reproduce the authors' experiments. Our results are broadly supportive of their conclusions; many attention heads can be ablated without a noticeable impact on performance, the encoder-decoder attention mechanism benefits the most from MHA, and important heads are determined in the early stages of training.

Track: Ablation

NeurIPS Paper Id: https://openreview.net/forum?id=ByxXhSBgIS

5 Replies

Loading