Gradient Heterogeneity Complements Hessian Heterogeneity in Transformer Optimization

TMLR Paper8817 Authors

08 May 2026 (modified: 14 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformers are difficult to optimize with stochastic gradient descent (SGD) and largely rely on adaptive optimizers such as Adam. Despite extensive efforts, a theoretical explanation for Adam's advantage over SGD in Transformer optimization is still incomplete. In this study, we analyze the optimization of Transformer models in the fine-tuning setting through the lens of gradient heterogeneity, defined as the variation in gradient norms across parameter blocks. We provide a theoretical analysis showing that gradient heterogeneity, together with Hessian heterogeneity, degrades the convergence of gradient-based methods such as SGD, while sign-based methods are substantially less sensitive to this effect. Adam's coordinate-wise normalization makes its update directions depend mainly on gradient signs, so Adam can be interpreted as a soft variant of SignSGD. Our analysis uses the fact that SGD and SignSGD follow steepest descent directions under different norms, and derives upper bounds on the iteration complexity with implications for learning rate scaling of SignSGD. We further investigate the origin of gradient heterogeneity in Transformer architectures and show that it is strongly influenced by the placement of layer normalization, with Post-LN architectures exhibiting particularly pronounced heterogeneity. Experimental results from fine-tuning Transformers in both NLP and vision domains validate our theoretical analysis.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Konstantin_Mishchenko1
Submission Number: 8817
Loading