AMPipe: Accelerating MoE Model Training with Intra-Block Pipelining

Yichao Fu; Yuhao QING; Shixiong Zhao; Fanxin Li; Bocheng Xiao; Dong HUANG; Heming Cui

AMPipe: Accelerating MoE Model Training with Intra-Block Pipelining

Yichao Fu, Yuhao QING, Shixiong Zhao, Fanxin Li, Bocheng Xiao, Dong HUANG, Heming Cui

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: infrastructure, software libraries, hardware, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Deep learning systems, parallel systems, pipeline, mixture of experts

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: The Mixture-of-Experts (MoE) architecture presents a compelling adaptation for expanding the model size of pre-trained models, such as large language models (LLMs), to enhance overall model performance (e.g., lower perplexity). However, as the sequence length (represented as $N$) increases, both the execution time of the attention layer ($O(N^2)$) and the all-to-all communication time of the MoE layer ($O(N)$, with a significant coefficient) become training bottlenecks. Current training systems have primarily focused on either optimizing the MoE layer (e.g., Tutel) or enhancing the attention layer (e.g., FlashAttention), yet they have demonstrated bounded performance improvements when confronted with long sequences---an essential consideration for modeling a potent language model with a long context window. In this paper, we introduce AMPipe, a novel pipeline system and paragdim for accelerating the training of large MoE models using Intra-Block Pipelining, particularly when dealing with lengthy sequences. AMPipe smartly optimizes two bottlenecks together by dividing and pipelining both the attention layer and MoE layer to strategically mitigate the time costs associated with these operations. Experimental results illustrate that AMPipe can consistently outperform current training systems, which solely focus on optimizing either the MoE or attention layer. Notably, AMPipe enhances the training throughput of a highly optimized transformer block by an average of 23\% across 56 benchmark cases and by up to 41\% in long sequence training, all without introducing statistical impact on model convergence. Our code is available at https://github.com/iclr24-3434/AMPipe.git

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3434

Loading