Sketched Adaptive Distributed Deep Learning: A Sharp Convergence Analysis

Zhijie Chen; Qiaobo Li; Arindam Banerjee

Sketched Adaptive Distributed Deep Learning: A Sharp Convergence Analysis

Zhijie Chen, Qiaobo Li, Arindam Banerjee

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Optimization, Distributed Learning, Communication Efficiency

TL;DR: We provide sharp convergence analysis on sketched adaptive distributed deep learning that only depends on the intrinsic dimension of loss Hessian, instead of the full dimensionality..

Abstract: Combining gradient compression with adaptive optimizers is a highly desirable goal in distributed learning, with potential benefits in both fewer communication rounds and less per-round communication. In spite of preliminary empirical promise, certain major challenges in the convergence analysis of such methods have stayed open: handling compression based approximation of both first and second moments (pre-conditioner) which appear as a ratio; avoiding dependence on the number of parameters, which is extremely large in modern deep models; and providing high-probability guarantees instead of in-expectation, which can hide high variance behavior. In this work, we introduce a family of Sketched Adaptive Distributed Learning (SADL) algorithms which can use suitable unbiased gradient sketching for compression with suitable adaptive optimization algorithms. As our main contribution, we provide theoretical convergence guarantees of SADL algorithms which addresses all of the existing challenges. In particular, our guarantees hold with high probability, picks up only a logarithmic dependence on the number of parameters, and the first and second moment approximation is handled precisely yielding a dependence on the intrinsic dimension of the loss Hessian, which is significantly smaller than the full dimensionality of deep learning models. Empirically, the SADL algorithms are shown to be competitive with and often outperform baselines on both vision and language tasks, in both supervised fine-tuning and training-from-scratch regimes. Further, the SADL algorithms are also competitive with the state-of-the-art communication-efficient distributed learning algorithms based on error feedback.

Supplementary Material: zip

Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)

Submission Number: 23560

Loading