Keywords: distributed optimization, quantization, asynchrony, decentralized optimization, non-blocking algorithms
TL;DR: We describe the first variant of decentralized SGD which combines asynchrony, non-blocking communication, local updates, and quantization, prove its convergence, and show that it provides state-of-the-art scalability in some settings.
Abstract: Decentralized optimization is emerging as a viable alternative for scalable distributed machine learning,
but also introduces new challenges in terms of synchronization costs.
To this end, several communication-reduction techniques, such as non-blocking communication, quantization, and local steps,
have been explored in the decentralized setting.
Due to the complexity of analyzing optimization in such a relaxed setting,
this line of work often assumes \emph{global} communication rounds, which require additional synchronization.
In this paper, we consider decentralized optimization in the simpler, but harder to analyze, \emph{asynchronous gossip} model,
in which communication occurs in discrete, randomly chosen pairings among nodes.
Perhaps surprisingly, we show that a variant of SGD called \emph{SwarmSGD} still converges in this setting,
even if \emph{non-blocking communication}, \emph{quantization}, and \emph{local steps} are all applied \emph{in conjunction}, and even if the node data distributions and underlying graph topology are both \emph{heterogenous}.
Our analysis is based on a new connection with multi-dimensional load-balancing processes.
We implement this algorithm and deploy it in a super-computing environment, showing that it can outperform previous decentralized methods in terms of end-to-end training time, and that it can even rival carefully-tuned large-batch SGD for certain tasks.
Supplementary Material: pdf
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
Code: zip
13 Replies
Loading