On the Interaction of Batch Noise, Adaptivity, and Compression, under $(L_0,L_1)$-Smoothness: An SDE Approach

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We develop an SDE-based framework for DCSGD and DSignSGD, showing DCSGD needs noise- and compression-dependent normalization for stability, while DSignSGD remains robust and convergent even under heavy-tailed noise.
Abstract: Distributed stochastic optimization intertwines (i) stochastic gradient noise, (ii) communication compression, and (iii) adaptive/normalized updates. While each factor has been studied in isolation, their joint effect under realistic assumptions remains poorly understood. In this work, we develop a unified theoretical framework for Distributed Compressed SGD (DCSGD) and its sign variant Distributed SignSGD (DSignSGD) under the recently introduced $(L_0, L_1)$-smoothness condition. From a conceptual perspective, we show that the first- and second-order modified equations from the literature do not accurately model the discrete-time step-size/stability restrictions, especially under $(L_0,L_1)$-smoothness. From a technical perspective, we propose new first-order SDEs by carefully incorporating curvature-dependent terms into their drift: This helps capture the fine-grained relationship between learning rate restrictions, gradient noise, compression, and the geometry of the loss landscape. Importantly, we do so under general gradient noise assumptions, including heavy-tailed and affine-variance regimes, which extend beyond the classical bounded-variance setting. Our results suggest that normalizing the updates of DCSGD emerges as a natural condition for stability, with the degree of normalization precisely determined by the gradient noise structure, the landscape’s regularity, and the compression rate. In contrast, DSignSGD converges even under heavy-tailed noise with standard learning rate schedules. Together, these findings offer both new theoretical insights and perspectives, and practical guidance.
Lay Summary: # Lay Summary Training large machine learning models often uses many computers at once. Each computer estimates the learning signal from a small batch of data, then sends a shortened version of that signal to save communication, and modern optimizers may also normalize these updates. These three choices—batch noise, compression, and normalization—can interact in ways that make training unstable, but most theory studies them separately or with assumptions that are too simple for modern neural-network losses. This paper builds new continuous-time mathematical models of compressed distributed SGD and distributed SignSGD that are designed to preserve the same stability limits as the original step-by-step algorithms. The main insight is that a model must account for how the curvature of the loss landscape changes with the gradient size; otherwise it can wrongly predict convergence when the actual optimizer would diverge. The analysis shows that compressed distributed SGD needs a specific amount of normalization, determined by the noise level, the compression strength, and the geometry of the loss. It also shows that distributed SignSGD is naturally robust to very large, heavy-tailed gradient errors because taking signs already normalizes the update. These results give practical guidance for choosing learning-rate schedules and normalization in noisy, compressed, distributed training.
Primary Area: Optimization->Stochastic
Keywords: Stochastic Differential Equations, $(L_0, L_1)$-Smoothness, Distributed Learning, Adaptivity
Originally Submitted PDF: pdf
Submission Number: 797
Loading