Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Signal Propagation, Language Model, Training Stability, Gradient Explosion, Moment Control, Rank Collapse
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: A complete theory of signal propagation in transformers, exactly prediciting forward and backward rank and variance, using which we train models 100s of layers deep.
Abstract: In spite of their huge success, transformer models remain difficult to scale in depth. In this work, we provide formulae that govern the moments of the forward and backward signal through all transformer components, and develop a unified signal propagation theory for transformers. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores. We also propose DeepScaleLM, an initialization and scaling scheme that conserves unit output/gradient moments throughout the model, enabling the training of very deep models with $100$s of layers. We find that transformer models could be much deeper -- our deep models improve $1.0$ points in perplexity, and $2.2$ points in downstream tasks compared to shallow models across multiple model sizes, without any extra parameters, and even outperform larger shallow models using only half the number of parameters.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5624
Loading