Keywords: neural language model, transformer, llm
Abstract: Loss spikes often occur during pre-training of large language models.
The spikes degrade the performance of large language models and sometimes ruin the pre-training.
Since the pre-training needs a vast computational budget, we should avoid such spikes.
Based on the assumption that the loss spike is caused by the sudden growth of the gradient norm, we explore factors to keep the gradient norm small through an analysis of the spectral norms of the Jacobian matrices for the sub-layers.
Our findings suggest that stabilizing the pre-training process requires two conditions: small sub-layers and large shortcut.
We conduct various experiments to empirically verify our theoretical analyses.
Experimental results demonstrate that methods satisfying the conditions effectively prevent loss spikes during pre-training.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9567
Loading