Abstract: Overlapping gradient communication with backward computation is a popular technique to reduce communication cost in the widely adopted data parallel S-SGD training. However, the resource contention between computation and All-Reduce communication in GPU-based training reduces the benefits of overlap. With GPU cluster network evolving from low bandwidth TCP to high speed networks, more GPU resources are required to efficiently utilize the bandwidth, making the contention more noticeable. Existing communication libraries fail to account for such contention when allocating GPU threads and have suboptimal performance. In this paper, we propose to mitigate the contention by balancing the overlapped computation and communication time. We formulate an optimization problem that decides the communication thread allocation to reduce overall backward time. We develop a dynamic programming based near-optimal solution and extend it to co-optimize thread allocation with tensor fusion. We conduct simulated study and real-world experiment using an 8-node GPU cluster with 50Gb RDMA network training four representative DNN models. Results show that our method reduces backward time by 10%-20% compared with Horovod-NCCL, by 6%-13% compared with tensor-fusion-optimization-only methods. Simulation shows that our method achieves the best scalability with a training speedup of 1.2x over the best-performing baseline as we scale up cluster size.
Loading