Peering Beyond the Gradient Veil with Efficient Distributed Auto Differentiation
- Keywords: deep learning, distributed, distributed deep learning, auto differentiation, distributed learning, federated learning
- Abstract: Although distributed machine learning has opened up numerous frontiers of research, fragmentation of models and data across different devices, nodes, and sites still causes significant communication overhead, making reliable training difficult. The focus on gradients as the primary shared statistic during training has led to a number of intuitive algorithms for distributed deep learning; however, gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy, often requiring additional modifications via sparsity constraints, compression, quantization, and other similar approaches, to reduce bandwidth. We introduce a novel communication-friendly approach for training distributed DNNs, based on exploiting outer-product structure in the gradient. Considering the form of the calculation produced by automatic differentiation leads to a new class of distributed learning algorithm with benefits but not the constraints of sharing full gradients. This approach invites a novel algorithm for compression via structured power iterations, which can not only reduce bandwidth, but also enables introspection into distributed training dynamics without significant performance loss. We show that this method provides better performance than other state of the art distributed methods on a number of large-scale text and imaging datasets, while consuming less bandwidth.