Keywords: Attention, Transformers, Optimization, Dynamics, Gradient Descent, Convergence
Abstract: Transformers have become ubiquitous in modern machine learning applications, yet their training remains a challenging task often requiring extensive trial and error. Unlike previous architectures, transformers possess unique attention-based components, which can complicate the training process. The standard optimization algorithm, Gradient Descent, consistently underperforms in this context, underscoring the need for a deeper understanding of these difficulties: existing theoretical frameworks fall short and fail to explain this phenomenon. To address this gap, we analyze a simplified Softmax attention model that captures some of the core challenges associated with training transformers. Through a local analysis of the gradient dynamics, we highlight the role of the Softmax function on the local curvature of the loss and show how it can lead to ill-conditioning of these models, which in turn can severely hamper the convergence speed. Our experiments confirm these theoretical findings on the critical impact of Softmax on the dynamics of Gradient Descent.
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13398
Loading