Abstract: Hierarchical SGD (H-SGD) has emerged as a new distributed
SGD algorithm for multi-level communication networks. In
H-SGD, before each global aggregation, workers send their
updated local models to local servers for aggregations. De-
spite recent research efforts, the effect of local aggregation
on global convergence still lacks theoretical understanding. In
this work, we first introduce a new notion of “upward” and
“downward” divergences. We then use it to conduct a novel
analysis to obtain a worst-case convergence upper bound for
two-level H-SGD with non-IID data, non-convex objective
function, and stochastic gradient. By extending this result to
the case with random grouping, we observe that this conver-
gence upper bound of H-SGD is between the upper bounds
of two single-level local SGD settings, with the number of
local iterations equal to the local and global update periods
in H-SGD, respectively. We refer to this as the “sandwich
behavior”. Furthermore, we extend our analytical approach
based on “upward” and “downward” divergences to study the
convergence for the general case of H-SGD with more than
two levels, where the “sandwich behavior” still holds. Our the-
oretical results provide key insights of why local aggregation
can be beneficial in improving the convergence of H-SGD.
0 Replies
Loading