Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction
Keywords: nonconvex optimization, lower bounds, distributed optimization
Abstract: We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $\sigma^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $\tau_{\textnormal{s}}$ and $\tau_{\textnormal{w}}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of \algname{SGD} has a variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{n \varepsilon^2},$ which improves with the number of workers $n,$ where $\Delta := f(x^0) - f^*,$ and $x^0 \in \mathbb{R}^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce \emph{both} the variance-dependent runtime term and the communication runtime term from $\tau_{\textnormal{w}} d \frac{L \Delta}{\varepsilon}$ to $\frac{\tau_{\textnormal{w}} d L \Delta}{n \varepsilon} + \sqrt{\frac{\tau_{\textnormal{w}} d h \sigma^2}{n \varepsilon}} \cdot \frac{L \Delta}{\varepsilon},$ which also benefits from increasing $n.$ However, once we account for the communication from the server to the workers $\tau_{\textnormal{s}}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $\tau_{\textnormal{s}} d \frac{L \Delta}{\varepsilon}$ and the variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{\varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same function or distribution. Indeed, when $\tau_{\textnormal{s}} \simeq \tau_{\textnormal{w}},$ our lower bound is $\tilde{\Omega}(\min[h (\frac{\sigma^2}{n \varepsilon} + 1) \frac{L \Delta}{\varepsilon} + {\tau_{\textnormal{s}} d \frac{L \Delta}{\varepsilon}},\; h \frac{L \Delta}{\varepsilon} + {h \frac{\sigma^2 L \Delta}{\varepsilon^2}}]).$ To establish this result, we construct a new ``worst-case'' function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous (i.i.d.) assumption.
Primary Area: optimization
Submission Number: 4737
Loading