Keywords: Attention, gradient descent, softmax, optimization
Abstract: Transformers have become ubiquitous in modern machine learning applications, yet their training remains a challenging task often requiring extensive trial and error. Unlike previous architectures, transformers possess unique attention-based components, which can complicate the training process. The standard optimization algorithm, Gradient Descent, consistently underperforms in this context, underscoring the need for a deeper understanding of these difficulties. To understand this phenomenon, we analyze a simplified yet representative softmax attention model. Our local analysis of the gradient dynamics reveals that the Jacobian of the softmax function itself acts as a preconditioner. We show that when sufficiently many attention coefficients are small across multiple training examples, the Jacobian of the softmax becomes ill-conditioned, severely degrading the local curvature of the loss, which in turn slows the convergence of Gradient Descent. Our experiments confirm these theoretical findings on the critical impact of softmax on the dynamics of Gradient Descent.
Primary Area: optimization
Submission Number: 21706
Loading