On the Interaction of Batch Noise, Adaptivity, and Compression, under $(L_0,L_1)$-Smoothness: An SDE Approach

On the Interaction of Batch Noise, Adaptivity, and Compression, under $(L_0,L_1)$-Smoothness: An SDE Approach

ICLR 2026 Conference Submission24942 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Stochastic Differential Equations, $(L_0, L_1)$-Smoothness, Distributed Learning, Adaptivity

TL;DR: We develop an SDE-based framework for DCSGD and DSignSGD, showing DCSGD needs noise- and compression-dependent normalization for stability, while DSignSGD remains robust and convergent even under heavy-tailed noise.

Abstract: Understanding the dynamics of distributed stochastic optimization requires accounting for several major factors that affect convergence, such as gradient noise, communication compression, and the use of adaptive update rules. While each factor has been studied in isolation, their joint effect under realistic assumptions remains poorly understood. In this work, we develop a unified theoretical framework for Distributed Compressed SGD (DCSGD) and its sign variant Distributed SignSGD (DSignSGD) under the recently introduced $(L_0, L_1)$-smoothness condition. Our analysis leverages stochastic differential equations (SDEs), and we show that while standard first-order SDEs might lead to misleading conclusions, including higher-order terms helps capture the fine-grained interaction between learning rates, gradient noise, compression, and the geometry of the loss landscape. These tools allow us to inspect the dynamics under general gradient noise assumptions, including heavy-tailed and affine-variance regimes, which extend beyond the classical bounded-variance setting. Our results show that normalizing the updates of DCSGD emerges as a natural condition for stability, with the degree of normalization precisely determined by the gradient noise structure, the landscape’s regularity, and the compression rate. In contrast, our model predicts that DSignSGD converges even under heavy-tailed noise with standard learning rate schedules, a finding which we empirically verify. Together, these findings offer both new theoretical insights and practical guidance for designing stable and robust distributed learning algorithms.

Primary Area: optimization

Submission Number: 24942

Loading