Abstract: Decentralized adaptive gradient methods, in which each node averages only with its neighbors, are critical to save communication and wall-clock training time in deep learning tasks. While different in concrete recursions, existing decentralized adaptive methods share the same algorithm structure: each node scales its gradient with information of the past squared gradients (which is referred to as the adaptive step) before or while it communicates with neighbors. In this paper, we identify the limitation of such adapt-then/while-communicate structure: it will make the developed algorithms highly sensitive to heterogeneous data distributions, and hence deviate their limiting points from the stationary solution. To overcome this limitation, we propose an effective decentralized adaptive method with a communicate-then-adapt structure, in which each node conducts the adaptive step after finishing the neighborhood communications. The new method is theoretically guaranteed to approach to the stationary solution in the non-convex scenario. Experimental results on a variety of CV/NLP tasks show that our method has a clear superiority to other existing decentralized adaptive methods.
17 Replies
Loading