Are Sixteen Heads Really Better than One?

Paul Michel; Omer Levy; Graham Neubig

Are Sixteen Heads Really Better than One?

Paul Michel, Omer Levy, Graham Neubig

06 Sept 2019 (modified: 05 May 2023)NeurIPS 2019Readers: Everyone

Abstract: Multi-headed attention is a driving force behind recent state-of-the-art NLP models. By applying multiple attention mechanisms in parallel, it can express sophisticated functions beyond the simple weighted average. However we observe that, in practice, a large proportion of attention heads can be removed at test time without significantly impacting performance, and that some layers can even be reduced to a single head. Further analysis on machine translation models reveals that the self-attention layers can be significantly pruned, while the encoder-decoder layers are more dependent on multi-headedness.

Code Link: https://github.com/pmichel31415/are-16-heads-really-better-than-1

CMT Num: 7830

1 Reply

Loading