Keywords: Transformers, self-attention, optimization, stability, spectral normalization, self-supervised learning, vision, speech, language, contrastive learning
Abstract: Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the "attention entropy" for each attention head during the course of training, which is a proxy of the attention's sharpness. We observe a common, non monotonic evolution of attention entropy across different settings: the attention entropy first quickly decreases in the initial phase of training, followed by quickly increasing, and finally entering a long stable phase. While the exact shape can be affected by hyperparameters such as warmup, initialization, learning rate etc., we found that there is a close correlation between the minima of attention entropy and the model's training stability. To this end, we propose a simple and efficient solution dubbed $\sigma$Reparam, where we reparametrize all linear layers with Spectral Normalization and an additional learned scalar. We provide a lower bound on the attention entropy as a function of the spectral norms of the query and key projections, which suggests that small attention entropy can be obtained with large spectral norms. $\sigma$Reparam decouples the growth rate of a weight matrix's spectral norm from its dimensionality, which we verify empirically. We conduct experiments with $\sigma$Reparam on image classification, image self supervised learning, automatic speech recognition and language modeling tasks. We show that $\sigma$Reparam provides great stability and robustness with respect to the choice of hyperparameters.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Optimization (eg, convex and non-convex optimization)
TL;DR: We introduce a weight reparameterization method which stabilizes transformer training across a variety of domains and setups, enabling simpler training recipes and robustness to hyperparameters without performance tradeoffs.
Supplementary Material: zip
12 Replies
Loading