Abstract: Deep neural networks (DNNs) with trillions of parameters
have emerged, e.g., Mixture-of-Experts (MoE) models. Training models of this scale requires sophisticated parallelization strategies like the newly proposed SPMD parallelism,
that shards each tensor along different dimensions. A common problem using SPMD is that computation stalls during
communication due to data dependencies, resulting in low
GPU utilization and long training time. We present a general technique to accelerate SPMD-based DNN training by
maximizing computation-communication overlap and automatic SPMD strategy search. The key idea is to duplicate
the DNN model into two copies that have no dependency,
and interleave their execution such that computation of one
copy overlaps with communication of the other. We propose
a dynamic programming algorithm to automatically identify
optimized sharding strategies that minimize model training
time by maximally enabling computation-communication
overlap. Experiments show that our designs achieve up to
61% training speed-up as compared to existing frameworks.
0 Replies
Loading