Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Ringmaster ASGD is the first optimal Asynchronous SGD method—it achieves optimality by ignoring gradients with delays larger than a certain threshold.
Abstract: Asynchronous Stochastic Gradient Descent (Asynchronous SGD) is a cornerstone method for parallelizing learning in distributed machine learning. However, its performance suffers under arbitrarily heterogeneous computation times across workers, leading to suboptimal time complexity and inefficiency as the number of workers scales. While several Asynchronous SGD variants have been proposed, recent findings by Tyurin & Richtárik (NeurIPS 2023) reveal that none achieve optimal time complexity, leaving a significant gap in the literature. In this paper, we propose Ringmaster ASGD, a novel Asynchronous SGD method designed to address these limitations and tame the inherent challenges of Asynchronous SGD. We establish, through rigorous theoretical analysis, that Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous and dynamically fluctuating worker computation times. This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.
Lay Summary: When training machine learning models across many computers, it’s common for some to work slower than others. This imbalance can make the whole process inefficient — like trying to row a boat with paddlers moving at different speeds. A popular method called Asynchronous SGD tries to handle this by letting faster machines move ahead without waiting. But recent research showed that no existing version of this method makes the best use of time when machines vary in speed, especially at large scale. We introduce a new method, Ringmaster ASGD, that coordinates these machines more effectively. Even when their speeds change unpredictably, our method keeps training efficient. We prove that it achieves the best possible speed in such scenarios. This could help a wide range of users — from companies training large language models to researchers working on scientific simulations — to build machine learning systems that are faster, more scalable, and better suited to real-world computing environments.
Primary Area: Optimization->Large Scale, Parallel and Distributed
Keywords: Asynchronous SGD, optimal time complexity, nonconvex optimization, parallel methods, stochastic optimization
Submission Number: 2280
Loading