Multi-Head Attention: Collaborate Instead of Concatenate

Jean-Baptiste Cordonnier; Andreas Loukas; Martin Jaggi

Multi-Head Attention: Collaborate Instead of Concatenate

Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: attention, self-attention, bert, multi-head, tensor factorization

Abstract: Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. However, they suffer from over-parameterization. For instance, it was shown that the majority of attention heads could be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that trained attention heads share common key/query projections, we propose a collaborative multi-head attention layer that enables heads to learn shared projections. Our scheme decreases the number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture.For instance, by allowing heads to collaborate on a neural machine translation task, we can reduce the key dimension by 4× without any loss in performance. We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer. Even without retraining, collaborative multi-head attention manages to reduce the size of the key and query projections by half without sacrificing accuracy. Our code is public.

One-sentence Summary: Multi-head attention learns similar key/query projections which can be shared across heads to decrease number of parameters.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/multi-head-attention-collaborate-instead-of/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=jTjPeZE_m

14 Replies

Loading