Adam or Gauss-Newton? — A Comparative Study In Terms of Basis Alignment and SGD Noise

ICLR 2026 Conference Submission15993 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Optimization, Adam, Gauss-Newton, Second-order Optimizers
TL;DR: We compare Adam and Gauss-Newton based diagonal optimizers for linear regression and various synthetic tasks in the identity and the Hessian basis and provide theoretical results under certain assumptions.
Abstract: Approximate second-order optimizers are increasingly showing promise in accelerating training of deep learning models, yet their practical performance depends critically on how preconditioning is applied. Two predominant approaches to preconditioning are based on (1) Adam, which leverages statistics of the current gradient, and (2) Gauss-Newton (GN) methods, which use approximations to the Fisher information matrix (often raised to a power). This work compares these approaches through the lens of two key factors: the choice of basis in the preconditioner and the impact of gradient noise from mini-batching. To gain insights, we analyze these optimizers on quadratic objectives and logistic regression under all four quadrants. We show that regardless of the basis, there exist instances where Adam outperforms both $\text{GN}^{-1}$ and $\text{GN}^{-1/2}$ in full-batch settings. Conversely, in the stochastic regime, Adam behaves similarly to $\text{GN}^{-1/2}$ under a Gaussian data assumption. These theoretical results are supported by empirical studies on both convex and non-convex objectives.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 15993
Loading