It has been widely observed that larger neural networks perform better in many real-world applications. While this scaling trend affirms the need to train a giant model across multiple devices, it is challenging to partition a model with millions of parameters to run efficiently and effectively on various devices deployed in a cluster of accelerators, e.g., GPUs and TPUs. Recently, a novel approach to training deep neural network (DNN) models distributedly has been proposed, pipeline parallelism. Compared with data parallelism, the existing works achieved a significant speed-up ratio even with a naive partition scheme.
This paper presents a deep reinforcement learning (DRL)-based pipeline parallelism framework, DRL-PP, that learns to optimize the pipeline schedule for training large DNN models across multiple accelerators. The core of DRL-PP is a DRL agent consisting of a graph encoder, describing the semantics of an operator in the computational graph, followed by a recurrent model partitioner and a pipeline scheduler that learns to partition and place operations on various GPU devices automatically. In particular, by generating placement in a recurrent way, DRL-PP can partition DNN models in a more flexible and balanced manner, which improves accelerator utilization and speeds up DNN training. We deployed and extensively evaluated DRL-PP on various benchmarks. Compared with the state-of-the-art, DRL-PP can speed up the distributed training of benchmark models up to 6.8 and 1.3 over data parallelism and PipeDream, respectively.