Keywords: Transformer Hessians, Layer Normalization, Scaling laws, Convergence dynamics, Loss landscape, Optimization geometry
TL;DR: We provide the first complete Hessian analysis of Transformer blocks—including LayerNorm and feedforward layers—linking second-order structure to convergence dynamics and scaling laws.
Abstract: The lack of theoretical results for Layer Normalization and feedforward Hessians has left a gap in the study of Transformer optimization landscapes. We address this by deriving explicit second-order expressions for these components, thereby completing the Hessian characterization of full Transformer blocks. Our results generalize prior self-attention analyses and yield estimations for the role of each sublayer in curvature propagation. We demonstrate how these Hessian structures inform both convergence dynamics and the empirical scaling laws governing large-model performance. Further, we propose a Taylor-expansion–based framework for analyzing loss differences to quantify convergence trajectories. By extending Hessian theory to the full Transformer architecture, this work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 20952
Loading