Pipeline Parallelism Optimization with Deep Reinforcement Learning

Hao Lan; Baochun Li; Wenting Wei; Xiaoshan Yu

Pipeline Parallelism Optimization with Deep Reinforcement Learning

Hao Lan, Baochun Li, Wenting Wei, Xiaoshan Yu

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: pdf

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: distributed machine learning, pipeline parallelism, deep reinforcement learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract:

It has been widely observed that larger neural networks perform better in many real-world applications. While this scaling trend affirms the need to train a giant model across multiple devices, it is challenging to partition a model with millions of parameters to run efficiently and effectively on various devices deployed in a cluster of accelerators, e.g., GPUs and TPUs. Recently, a novel approach to training deep neural network (DNN) models distributedly has been proposed, pipeline parallelism. Compared with data parallelism, the existing works achieved a significant speed-up ratio even with a naive partition scheme.

This paper presents a deep reinforcement learning (DRL)-based pipeline parallelism framework, DRL-PP, that learns to optimize the pipeline schedule for training large DNN models across multiple accelerators. The core of DRL-PP is a DRL agent consisting of a graph encoder, describing the semantics of an operator in the computational graph, followed by a recurrent model partitioner and a pipeline scheduler that learns to partition and place operations on various GPU devices automatically. In particular, by generating placement in a recurrent way, DRL-PP can partition DNN models in a more flexible and balanced manner, which improves accelerator utilization and speeds up DNN training. We deployed and extensively evaluated DRL-PP on various benchmarks. Compared with the state-of-the-art, DRL-PP can speed up the distributed training of benchmark models up to 6.8 and 1.3 over data parallelism and PipeDream, respectively.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6582

Loading