Consensus Is All You Get: The Role of Attention in Transformers

Álvaro Rodríguez Abella; João Pedro Silvestre; Paulo Tabuada

Consensus Is All You Get: The Role of Attention in Transformers

Álvaro Rodríguez Abella, João Pedro Silvestre, Paulo Tabuada

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: We provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers, and also empirical results, showing that under certain assumptions all tokens asymptotically converge to one cluster.

Abstract: A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token along the layers of a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.

Lay Summary: The self-attention mechanism is the key component of transformers, the main architecture of modern Large Language Models (LLMs). What is the role that self-attention plays in transformers? The previous literature suggests that self-attention forces tokens (words in a sentence or characters in a word) to collapse, that is, all the tokens became the same as the depth of a transformer increases. In this work, we mathematically prove that tokens collapse. Our results are empirically shown to hold, even when our modeling assumptions are not satisfied, using the GPT-2 and the GPT-Neo models. Having established that tokens collapse, our paper suggests the need to study the optimal transformer depth: transformers that are too short do not adequately use the context provided by a prompt whereas transformers that are too long ignore the prompt since all the tokens became the same.

Link To Code: https://github.com/cyphylab/GPT-Consensus/tree/main

Primary Area: Theory->Deep Learning

Keywords: Attention, Transformers, Consensus

Submission Number: 5037

Loading