Abstract: Attention-based language models usually rely on the softmax function to convert attention logits into probability vectors. However, this process can lead to \emph{attention entropy collapse}, where the attention concentrates on a single token, causing training instability. In this work, we identify high \emph{variance-entropy sensitivity} of softmax as a root cause of this phenomenon and reproduce it with large language models (LLMs) and a simple Transformer model, demonstrating that \emph{Lipschitz-kernel}-based attention is robust against attention entropy collapse. We demonstrate through controlled and real training settings that Lipschitz-kernel-based and softmax-based attention exhibit differences in sensitivity to \emph{attention logits variance}. We reveal that the high sensitivity of softmax-based attention to the variance contributes to attention entropy collapse. Moreover, we argue that attention entropy collapse leads to training instability because, as attention probabilities become more concentrated, the norm of the attention probability matrix increases, ultimately causing a gradient explosion.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Transformer, kernel attention, softmax, entropy collapse, LLMs
Contribution Types: Model analysis & interpretability, Reproduction study, Approaches to low-resource settings
Languages Studied: English
Submission Number: 1340
Loading