Keywords: stochastic gradient descent, linear stability, saddle points, deep learning theory
Abstract: Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem in neural network optimization. We identify two distinct types of saddle points, demonstrating that Type-II saddles pose a significant challenge due to vanishing gradient noise, which makes them particularly difficult for SGD to escape. We show that the dynamics around these saddles can be effectively modeled by a random matrix product process, allowing us to apply concepts from probabilistic stability and Lyapunov exponents. By leveraging ergodic theory, we establish that saddle points can be either attractive or repulsive for SGD, leading to a classification of four distinct dynamic phases based on the gradient's signal-to-noise ratio near the saddle. We apply the theory to the training at the initial stage of neural networks, explaining an intriguing phenomenon that neural networks are prone to be stuck at the initialization point at a larger learning rate. Our results offer a novel theoretical framework for understanding the intricate behavior of SGD around saddle points, with implications for improving optimization strategies in deep learning.
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2372
Loading