On the Interaction of Noise, Compression, and Adaptivity under $(L_0,L_1)$-Smoothness: An SDE Approach

Published: 09 Jun 2025, Last Modified: 01 Jul 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Stochastic Differential Equations, Adaptivity, Compression, Stochastic Optimization, Distributed Learning
TL;DR: We use SDEs to provide convergence bounds for DSGD, DCSGD and DSignSGD under $(L_0,L_1)$-Smoothness for various batch noise structures.
Abstract: Using stochastic differential equation (SDE) approximations, we study the dynamics of Distributed SGD, Distributed Compressed SGD, and Distributed SignSGD under $(L_0,L_1)$-smoothness and flexible noise assumptions. Our analysis provides insights -- which we validate through simulation -- into the intricate interactions between batch noise, stochastic gradient compression, and adaptivity in this modern theoretical setup. For instance, we show that *adaptive* methods such as Distributed SignSGD can successfully converge under standard assumptions on the learning rate scheduler, even under heavy-tailed noise. On the contrary, Distributed (Compressed) SGD with pre-scheduled decaying learning rate fails to achieve convergence, unless such a schedule also accounts for an inverse dependency on the gradient norm -- de facto falling back into an adaptive method.
Student Paper: Yes
Submission Number: 39
Loading