Abstract: Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Even *at initialisation*, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapse *in depth*, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications---a common pattern across various architectures---we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapse *in width*, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii).
Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.
Lay Summary: Transformers are powerful AI models behind tools like language translators and chatbots. A key part of how they work is called attention, which helps the model decide what information to focus on. However, this attention mechanism can sometimes fail, especially when the model is dealing with long inputs or just starting to learn. When that happens, the model may treat different inputs as if they were the same, making it hard to learn or generate useful outputs.
Our research identifies a new reason for this problem: a hidden imbalance in the attention mechanism that gets worse as the input length increases. Using tools from mathematics, we show that this imbalance causes the model to lose important distinctions between data points—a problem we call rank collapse in width. We also propose a simple fix that helps the model stay stable and learn more effectively.
This work sheds light on why some recent practical improvements to attention have worked and offers a clearer path forward for building better, more reliable AI systems.
Link To Code: https://github.com/thizirinait/Mind-the-Gap-a-Spectral-Analysis-of-Rank-Collapse-and-Signal-Propagation-in-Attention-Layers
Primary Area: Theory->Deep Learning
Keywords: transformers, attention mechanism, spectral analysis, initialisation, random matrix theory, signal propagation, softmax
Submission Number: 6904
Loading