Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws

Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws

ICLR 2026 Conference Submission20952 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformer Hessians, Layer Normalization, Scaling laws, Convergence dynamics, Loss landscape, Optimization geometry

TL;DR: We provide the first complete Hessian analysis of Transformer blocks—including LayerNorm and feedforward layers—linking second-order structure to convergence dynamics and scaling laws.

Abstract: The lack of theoretical results for Layer Normalization and feedforward Hessians has left a gap in the study of Transformer optimization landscapes. We address this by deriving explicit second-order expressions for these components, thereby completing the Hessian characterization of full Transformer blocks. Our results generalize prior self-attention analyses and yield estimations for the role of each sublayer in curvature propagation. We demonstrate how these Hessian structures inform both convergence dynamics and the empirical scaling laws governing large-model performance. Further, we propose a Taylor-expansion–based framework for analyzing loss differences to quantify convergence trajectories. By extending Hessian theory to the full Transformer architecture, this work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 20952

Loading