BEYOND DATA AND MODEL PARALLELISM FOR DEEP NEURAL NETWORKS
Abstract: Existing deep learning systems commonly parallelize deep neural network (DNN) training using data or model
parallelism, but these strategies often result in suboptimal parallelization performance. We introduce SOAP, a
more comprehensive search space of parallelization strategies for DNNs that includes strategies to parallelize a
DNN in the Sample, Operator, Attribute, and Parameter dimensions. We present FlexFlow, a deep learning engine
that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel
machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a
parallelization strategy’s performance and is three orders of magnitude faster than prior approaches that execute
each strategy. We evaluate FlexFlow with six real-world DNN benchmarks on two GPU clusters and show that
FlexFlow increases training throughput by up to 3.3× over state-of-the-art approaches, even when including its
search time, and also improves scalability.
0 Replies
Loading