Variance Sensitivity Induces Attention Entropy Collapse and Instability in Transformers

Variance Sensitivity Induces Attention Entropy Collapse and Instability in Transformers

ACL ARR 2025 May Submission5164 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Attention-based language models typically rely on the softmax function to convert attention logits into probability distributions. However, this mechanism can lead to $\textit{attention entropy collapse}$, where attention is focused on a single token, causing training instability. In this work, we identify the high $\textit{variance sensitivity}$ of softmax as a primary cause of this collapse. We show that $\textit{entropy-stable}$ attention mechanisms, which either control or are insensitive to the variance of attention logits, can prevent entropy collapse and enable more stable training. We provide empirical evidence of this effect in both large language models (LLMs) and a small Transformer model composed solely of self-attention and support our findings with theoretical analysis. Moreover, we identify that the concentration of attention probabilities increases the probability matrix norm, leading to a gradient exploding.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Transformer, kernel attention, softmax, entropy collapse, LLMs

Contribution Types: Model analysis & interpretability, Reproduction study

Languages Studied: English

Keywords: Attention, Entropy collapse, Training Instability, Transformer

Submission Number: 5164

Loading