- Keywords: distributed training, model-parallel, model parallelism, pipeline, fault tolerance, communication efficiency, volunteer computing
- Abstract: Many deep learning applications benefit from using large models with billions of parameters. These models can only be trained with specialized distributed training algorithms that require low-latency and high-bandwidth interconnect. As a result, large models are typically trained in dedicated GPU clusters that can be extremely costly to deploy and operate. In contrast, there are more affordable distributed training setups, such as using cheap "preemptible" instances or pooling together existing resources from multiple regions. However, both these setups come with unique challenges that make it impractical to train large models using conventional model parallelism. In this work, we carefully analyze these challenges and find configurations where training larger models becomes less communication-intensive. Based on these observations, we propose SWARM Parallelism (Stochastically Wired Adaptively Rebalanced Model Parallelism) — a model-parallel training algorithm designed for swarms of poorly connected, heterogeneous unreliable devices. SWARM creates temporary randomized pipelines between available nodes that are rebalanced in case of failure. To further reduce the network usage of our approach, we develop several compression-aware architecture modifications and evaluate their tradeoffs. Finally, we combine our insights to train a large Transformer language model with 1.1B shared parameters (approximately 13B before sharing) on a swarm of preemptible T4 GPUs with less than 400Mb/s network throughput.
- One-sentence Summary: We propose SWARM Parallelism — a model-parallel training algorithm designed for swarms of poorly connected, heterogeneous unreliable devices.