Nesterov Method for Asynchronous Pipeline Parallel Optimization

Thalaiyasingam Ajanthan; Sameera Ramasinghe; Yan Zuo; Gil Avraham; Alexander Long

Nesterov Method for Asynchronous Pipeline Parallel Optimization

Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, Alexander Long

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce a variant of Nesterov accelerated gradient method to effectively address gradient staleness in asynchronous pipeline parallel optimization, demonstrating feasibility in large-scale language model training.

Abstract: Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading to *stale (or delayed) gradients*. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to **1B parameters**, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.

Lay Summary: Training very large neural networks often requires splitting the model into parts and running them across several smaller devices. If the connection bandwidth between these devices is low (e.g., the internet), the devices would stay idle due to communication delays. Asynchronous optimization eliminates this idle time by ensuring all devices are active at all times. This comes at a cost of incorrect (or delayed) information being used for training, which often affects model performance. We address this by predicting the future state of the model using a look-ahead approach, effectively removing inaccuracies. We provide a theoretical guarantee that our approach still converges. Our experiments in training large language models show that our asynchronous method not only improves device utilization but also improves the final model performance compared to synchronized training. This shows the possibility of training large AI models using devices connected via the internet, instead of expensive centralized infrastructures.

Link To Code: https://github.com/PluralisResearch/AsyncPP

Primary Area: Optimization->Large Scale, Parallel and Distributed

Keywords: Asynchronous Optimization, Pipeline Parallelism, Nesterov Method, Convergence Analysis, Decentralized Training, Protocol Learning

Submission Number: 8346

Loading