SDEs for Adaptive Methods: The Role of Noise

Enea Monzio Compagnoni; Tianlin Liu; Rustem Islamov; Frank Norbert Proske; Antonio Orvieto; Aurelien Lucchi

SDEs for Adaptive Methods: The Role of Noise

Enea Monzio Compagnoni, Tianlin Liu, Rustem Islamov, Frank Norbert Proske, Antonio Orvieto, Aurelien Lucchi

15 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Stochastic Differential Equations, Stochastic Optimization, Adaptive Methods

TL;DR: We derive novel SDEs for SignSGD, RMSprop(W), and Adam(W), providing a more accurate theoretical and understanding of their dynamics, convergence, and robustness. We validate our findings with experiments on various neural network architectures.

Abstract: Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. In this work, we introduce novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). Our SDEs offer a quantitatively accurate description of these optimizers and help bring to light an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama discretization on various neural network architectures such as MLPs, CNNs, ResNets, and Transformers. Our SDEs accurately track the behavior of the respective optimizers, especially when compared to previous SDEs derived for Adam and RMSprop. We believe our approach can provide valuable insights into best training practices and novel scaling rules.

Supplementary Material: zip

Primary Area: Optimization (convex and non-convex, discrete, stochastic, robust)

Submission Number: 16282

Loading