Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
Abstract: The past few years have witnessed growth in the
computational requirements for training deep convolutional neural networks. Current approaches
parallelize training onto multiple devices by applying a single parallelization strategy (e.g., data
or model parallelism) to all layers in a network.
Although easy to reason about, these approaches
result in suboptimal runtime performance in largescale distributed training, since different layers
in a network may prefer different parallelization
strategies. In this paper, we propose layer-wise
parallelism that allows each layer in a network
to use an individual parallelization strategy. We
jointly optimize how each layer is parallelized by
solving a graph search problem. Our evaluation
shows that layer-wise parallelism outperforms
state-of-the-art approaches by increasing training throughput, reducing communication costs,
achieving better scalability to multiple GPUs,
while maintaining original network accuracy.
0 Replies
Loading