Almost Bayesian: Dynamics of SGD Through Singular Learning Theory

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: singular learning theory, SGD, gradient noise, gradient descent, Fokker-Planck, training dynamics, Bayes, Bayesian
TL;DR: We examine the long runtime dynamics of SGD as diffusion on porous media using tools from singular learning theory.
Abstract: The nature of the relationship between Bayesian sampling and stochastic gradient descent in neural networks has been a long-standing open question in the theory of deep learning. We shed light on this question by modeling the long runtime behaviour of SGD as diffusion on porous media. Using singular learning theory, we show that the late stage dynamics are strongly impacted by the degeneracies of the loss surface. From this we are able to show that under reasonable choices of hyperparameters for vanilla SGD, the local steady state distribution of SGD (if it exists) is effectively a tempered version of the Bayesian posterior over the weights which accounts for local accessibility constraints.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 8995
Loading