Momentum Look-Ahead for Asynchronous Distributed Low-Communication Training

Published: 06 Mar 2025, Last Modified: 31 Mar 2025MCDC @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Asynchronous Diloco, Nesterov method, Momentum look-ahead
TL;DR: We introduce a momentum based look-ahead method to alleviate gradient staleness in asynchronous DiLoCo training for language modelling.
Abstract: Distributed Low-Communication (DiLoCo) allows large-scale model training across geographically distributed datacenters by reducing the communication overhead in the data parallel setting. Asynchronous DiLoCo further relaxes the requirement to synchronize the model updates, eliminating any bottlenecks due to slow devices or interconnects. Nevertheless, asynchronous updates introduce *stale (or delayed) gradients* as model updates and gradient computation are no longer synchronized. To alleviate staleness, we introduce a look-ahead based delay correction mechanism by *extrapolating the negative direction of momentum*. Our experiments on language modelling tasks with decoder-only architectures demonstrate that our approach consistently outperforms asynchronous and synchronous DiLoCo methods in both homogeneous and heterogeneous settings.
Submission Number: 26
Loading