Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects
TL;DR: We derive SDEs and ODEs for SignSGD in high-dimensions: identifying scheduling, preconditioning, noise compression effects.
Abstract: In recent years, SignSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that SignSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of SignSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of SignSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.
Lay Summary: Despite being one of the most widely used algorithms in modern machine learning, Adam remains poorly understood from a theoretical perspective. To shed light on its behavior, we study signSGD—a special case of Adam—with the goal of gaining deeper insight into the algorithm. We model signSGD using a deterministic system of ordinary differential equations (ODEs) that accurately describe its dynamics in the high-dimensional limit. This approach allows us to replace the inherently stochastic process with a deterministic one. Our analysis uncovers precise preconditioning and noise compression effects that have long been hypothesized for signSGD and, by extension, Adam.
Primary Area: Optimization->Stochastic
Keywords: signSGD, stochastic optimization, deep learning theory, high-dimensional probability, stochastic differential equation
Submission Number: 7925
Loading