Breadth-first pipeline parallelism

Joel Lamy-Poirier

Breadth-first pipeline parallelism

Joel Lamy-Poirier

Published: 20 Oct 2022, Last Modified: 05 May 2023HITY Workshop NeurIPS 2022Readers: Everyone

Keywords: machine learning, distributed computing, deep learning, large language models, pipeline parallelism

TL;DR: We propose a new method that improves the training speed of large language models.

Abstract: We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers the training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed increases of up to 53% in training speed.

4 Replies

Loading