Training Instability of Transformers with Softmax and Lipschitz-Kernel Attentions

Training Instability of Transformers with Softmax and Lipschitz-Kernel Attentions

ACL ARR 2025 February Submission1340 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Attention-based language models usually rely on the softmax function to convert attention logits into probability vectors. However, this process can lead to \emph{attention entropy collapse}, where the attention concentrates on a single token, causing training instability. In this work, we identify high \emph{variance-entropy sensitivity} of softmax as a root cause of this phenomenon and reproduce it with large language models (LLMs) and a simple Transformer model, demonstrating that \emph{Lipschitz-kernel}-based attention is robust against attention entropy collapse. We demonstrate through controlled and real training settings that Lipschitz-kernel-based and softmax-based attention exhibit differences in sensitivity to \emph{attention logits variance}. We reveal that the high sensitivity of softmax-based attention to the variance contributes to attention entropy collapse. Moreover, we argue that attention entropy collapse leads to training instability because, as attention probabilities become more concentrated, the norm of the attention probability matrix increases, ultimately causing a gradient explosion.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Transformer, kernel attention, softmax, entropy collapse, LLMs

Contribution Types: Model analysis & interpretability, Reproduction study, Approaches to low-resource settings

Languages Studied: English

Submission Number: 1340

Loading