Abstract: Stochastic Gradient Descent can be effectively paralellized to many workers via the use of minibatches. Yet, parameter synchronization requires that each batch waits for the slowest worker to finish. There is a fully asynchronous method, detailed by Dean et. al (2012), called Downpour SGD, or ASGD, which minimizes worker idle time by allowing gradients which are computed on stale parameters to be sent to the parameter server. In practice, direct usage of ASGD is not recommended due to the added noise from stale gradients (referred as the "delayed gradient problem"), and therefore some form of delay compensation is required, as is detailed in Zheng et al (2017). In this paper, we present a detailed analysis of the failure modes of asynchronous SGD due to delayed gradients under various hyperparameter selections in order to better inform in what cases ASGD is best applied. On the MNIST digit recognition task with the LeNet5 model we find that delayed gradients significantly reduce test accuracy with large batch sizes and large learning rates. This limits the applicability of asynchronous gradient methods (without delay compensation) in cases where the learning rate is scaled linearly with the batch size, or with adaptive methods that may select large learning rates at times. Finally, we provide discussion on possible delay compensation methods.
Keywords: asynchronous, SGD, gradient descent, parallel training, workers, reduction, delayed gradient, convolutional neural network
TL;DR: A causal analysis of test accuracy reduction when using asynchronous SGD, and a recommendation of hyperparameter bounds in which to use delayed gradients.
5 Replies
Loading