UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training

Guangyao Zhou; Wenhong Tian; Rajkumar Buyya; Kui Wu

UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training

Guangyao Zhou, Wenhong Tian, Rajkumar Buyya, Kui Wu

Published: 01 Jan 2025, Last Modified: 13 May 2025IEEE Trans. Parallel Distributed Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The increasing need for large-scale deep neural networks (DNN) has made parallel training an area of intensive focus. One effective method, microbatch-based pipeline parallelism (notably GPipe), accelerates parallel training in various architectures. However, existing parallel training architectures normally use equal data partitioning (EDP), where each layer's process maintains identical microbatch-sizes. EDP may hinder training speed because different processes often require varying optimal microbatch-sizes. To address this, we introduce UMPIPE, a novel framework for unequal microbatches-based pipeline parallelism. UMPIPE enables unequal data partitions (UEDP) across processes to optimize resource utilization. We develop a recurrence formula to calculate the time cost in UMPIPE by considering both computation and communication processes. To further enhance UMPIPE's efficiency, we propose the Dual-Chromosome Genetic Algorithm for UMPIPE (DGAP) that accounts for the independent time costs of forward and backward propagation. Furthermore, we present TiDGAP, a two-level improvement on DGAP. TiDGAP accelerates the process by simultaneously calculating the end time for multiple individuals and microbatches using matrix operations. Our extensive experiments validate the dual-chromosome strategy's optimization benefits and TiDGAP's acceleration capabilities. TiDGAP can achieve better training schemes than baselines, such as the local greedy algorithm and the global greedy-based dynamic programming. Compared to (GPipe, PipeDream), UMPIPE achieves increases in training speed: $(13.89,11.09)\%$ for GPT1-14, $(17.11, 7.96)\%$ for VGG16 and $\geq (170,100)\%$ for simulation networks.

Loading