TL;DR: We introduce a variant of Nesterov accelerated gradient method to effectively address gradient staleness in asynchronous pipeline parallel optimization, demonstrating feasibility in large-scale language model training.
Abstract: Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading to *stale (or delayed) gradients*. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to **1B parameters**, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.
Lay Summary: Training very large neural networks often requires splitting the model into parts and running them across several smaller devices. If the connection bandwidth between these devices is low (e.g., the internet), the devices would stay idle due to communication delays. Asynchronous optimization eliminates this idle time by ensuring all devices are active at all times. This comes at a cost of incorrect (or delayed) information being used for training, which often affects model performance.
We address this by predicting the future state of the model using a look-ahead approach, effectively removing inaccuracies. We provide a theoretical guarantee that our approach still converges. Our experiments in training large language models show that our asynchronous method not only improves device utilization but also improves the final model performance compared to synchronized training.
This shows the possibility of training large AI models using devices connected via the internet, instead of expensive centralized infrastructures.
Link To Code: https://github.com/PluralisResearch/AsyncPP
Primary Area: Optimization->Large Scale, Parallel and Distributed
Keywords: Asynchronous Optimization, Pipeline Parallelism, Nesterov Method, Convergence Analysis, Decentralized Training, Protocol Learning
Submission Number: 8346
Loading