Keywords: Transformer, Training Dynamics, Model Crash
TL;DR: We revisiting the training dynamics of Transformer to tame its training process without using learning rate warmup.
Abstract: Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and an obviously lower learning rate, is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for training Transformer and reveal a key problem behind the model crash phenomenon in the training, \ie, the spectral energy concentration of $W_q^{\top} W_k$ (where $W_q$ and $W_k$ are the projection matrices for query and key in Transformer), which is the reason for a malignant entropy collapse. To remedy this problem, motivated by Weyl's Inequality, we present a novel optimization strategy---making weight updating in successive steps smooth, that is, if the ratio $\frac{\sigma_{1}(\nabla W_t)}{\sigma_{1}(W_{t-1})}$ is larger than a threshold, where $\nabla W_t$ is the updating quantity in step $t$, we will automatically bound the learning rate to a weighted multiply of $\frac{\sigma_{1}(W_{t-1})}{\sigma_{1}(\nabla W_t)}$. Our optimization strategy is able to prevent the rapid spectral energy concentration to only a few directions, and thus is able to avoid the malignant entropy collapse that will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these (Transformer) models without using learning rate warmup.
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9501
Loading