AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

ICLR 2026 Conference Submission15632 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Asynchronous Optimization, Sparse Averaging, Data and Pipeline Parallelism, Decentralized Training
TL;DR: We introduce a fully asynchronous optimization method to address the communication overhead in both data and pipeline parallelisms, and demonstrate its efficacy in large scale language modelling tasks.
Abstract: Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing *asynchronous updates across both parallelism axes*, relaxing the co-location requirement at the expense of introducing *staleness* between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an *asynchronous sparse averaging* method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to *1B parameters*) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15632
Loading