Keywords: Asynchronous Pipeline Parallelism, Convergence Optimization, Multi-directional Pipelines, Parameter Mismatch Reduction, Gradient Accumulation
TL;DR: AMDP accelerates pipeline-parallel training by limiting early-stage mini-batches and using multi-directional pipelines with adaptive gradient accumulation to minimize parameter mismatch.
Abstract: Pipeline parallelism has become a critical technique for scaling up the training of large models. However, existing asynchronous pipeline approaches often suffer from degraded convergence due to parameter mismatch between forward and backward passes. To address this, we propose Asynchronous Multi-Directional Pipeline parallelism (AMDP). AMDP limits stage 0 of each pipeline to read only two minibatches before initiating the first backward pass, thereby reducing the number of parameter updates that occur between the forward and backward passes of each minibatch. To mitigate the pipeline bubbles introduced by this restriction, AMDP instantiates multiple concurrent pipelines and adapts their number according to pipeline depth. Furthermore, AMDP accumulates gradients across minibatches and applies them in a single parameter update, ensuring that only a small number of minibatches (bounded by the pipeline depth) encounter parameter mismatch, which is constrained to within one step. Experiments on GPT- and BERT-style models demonstrate that AMDP significantly accelerates the training of large-scale models while preserving convergence. The source code based on Megatron-LM is avaiable at https://anonymous.4open.science/r/Megatron-AMDP-BB23
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 7416
Loading