Keywords: sgd noise, loss landscape, Fokker–Planck equation
TL;DR: SGD isn't Brownian motion. We derive a Fokker-Planck equation showing SGD dynamics are deterministic in a fluctuating landscape. This explains inverse variance-flatness, validated empirically.
Abstract: The conventional wisdom in deep learning theory often models Stochastic Gradient Descent (SGD) as a Brownian particle, described by a Langevin equation. In this work, we challenge this paradigm and propose a more fundamental perspective: SGD is best understood as deterministic dynamics within a fluctuating loss landscape. From first principles, we derive a master equation for the parameter evolution and its corresponding Fokker-Planck equation, which exhibits differences from the standard form used for Brownian motion. We analyze the resulting dynamics near minima, where the loss is approximately quadratic, and identify distinct behavioral regimes. The most intriguing behavior emerges in the presence of valleys. We demonstrate that in this regime, the dynamics do not converge to a stationary distribution. Instead, individual SGD trajectories diffuse along the floor of these valleys with an effective diffusion coefficient proportional to the learning rate. We empirically validate these theoretical claims through experiments on deep learning tasks in computer vision and natural language processing.
Primary Area: optimization
Submission Number: 12428
Loading