Quality over Quantity in Attention Layers: When Adding More Heads Hurts

Noah Amsel; Gilad Yehudai; Joan Bruna

Quality over Quantity in Attention Layers: When Adding More Heads Hurts

Noah Amsel, Gilad Yehudai, Joan Bruna

Published: 22 Jan 2025, Last Modified: 03 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Learning theory, Expressive capacity, Expressive power, Transformer, Attention

TL;DR: We prove that attention layers with a small query/key dimension can have weak expressive power. This weakness cannot be overcome without using a massive number of attention heads.

Abstract: Attention-based mechanisms are widely used in machine learning, most prominently in transformers. However, hyperparameters such as the number of attention heads and the attention rank (i.e., the query/key dimension) are set nearly the same way in all realizations of this architecture, without theoretical justification. In this paper, we prove that the rank can have a dramatic effect on the representational capacity of attention. This effect persists even when the number of heads and the parameter count are very large. Specifically, we present a simple and natural target function based on nearest neighbor search that can be represented using a single full-rank attention head for any sequence length, but that cannot be approximated by a low-rank attention layer even on short sequences unless the number of heads is exponential in the embedding dimension. Thus, for this target function, rank is what determines an attention layer's power. We show that, for short sequences, using multiple layers allows the target to be approximated by low-rank attention; for long sequences, we conjecture that full-rank attention is necessary regardless of depth. Finally, we present experiments with standard multilayer transformers that validate our theoretical findings. They demonstrate that, because of how standard transformer implementations set the rank, increasing the number of attention heads can severely decrease accuracy on certain tasks.

Supplementary Material: zip

Primary Area: learning theory

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5065

Loading