Keywords: Distributed deep learning; gradient prediction; asynchronous SGD; convergence; time delay
Abstract: In this paper, we propose a new algorithm, termed Predicting Clipping Asynchronous Stochastic Gradient Descent (aka, PC-ASGD) to address the issue of staleness and time delay in asynchronous distributed learning settings. Specifically, PC-ASGD has two steps - the predicting step leverages the gradient prediction using Taylor expansion to reduce the staleness of the outdated weights while
the clipping step selectively drops the outdated weights to alleviate their negative effects. A tradeoff parameter is introduced to balance the effects between these two steps. We theoretically present the convergence rate considering the effects of delay of the proposed algorithm with constant step size when the smooth objective functions are nonconvex. For empirical validation, we demonstrate the performance of the algorithm with two deep neural network architectures on two benchmark datasets.
Submission Number: 25
Loading