Abstract: Pipeline parallelism has emerged as an indispensable technique for training large deep neural networks. While existing asynchronous pipeline systems address the time bubbles inherent in synchronous architectures, they continue to suffer from inefficiency and susceptibility to volatile hardware environment due to their suboptimal and static configurations. In this article, we propose DynPipe, an interference-aware asynchronous pipeline framework to optimize the end-to-end training performance in highly dynamic computing environments. By characterizing the non-overlapped communication overheads and convergence rate conditioned on stage-wise staleness, DynPipe carefully crafts an optimized pipeline partition that harmonizes the hardware speed with statistical convergence. Moreover, DynPipe deploys a non-intrusive random forest model that utilizes runtime stage statistics to evaluate the impact of environmental changes, such as task interference and network jitter, on the training efficiency. Following the evaluation guidance, DynPipe adaptively adjusts partition plan to restore both intra and inter-stage load balancing, thereby facilitating seamless pipeline reconfiguration in dynamic environments. Extensive experiments show that DynPipe outperforms state-of-the-art systems, accelerating the time-to-accuracy by 1.5-3.4×.
External IDs:dblp:journals/tpds/YuanWNTLSLLJ25
Loading