Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

ICLR 2026 Conference Submission14237 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: distributed optimization, stochastic optimization, distributed learning, momentum, iterate averaging, primal averaging
TL;DR: We introduce a new primal averaging scheme called Generalized Primal Averaging that smooths DiLoCo and generalizes Nesterov acceleration and Schedule-Free learning.
Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation, that addresses key limitations of recent averaging-based optimizers such as DiLoCo and Schedule-Free (SF). These two recent algorithmic approaches improve the performance of base optimizers such as AdamW through distinct averaging strategies. Schedule-Free explicitly averages iterates at every step, while DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. This periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters to tune. To address these limitations, GPA smoothens DiLoCo in the non-distributed setting by averaging iterates at every iteration using two interpolation constants. When applied to language model pre-training, GPA consistently outperforms DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning and reducing memory overhead to just a single additional buffer. Furthermore, we prove that for any base optimizer with regret bounded by $\mathcal{O}(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of the interpolation constants.
Primary Area: optimization
Submission Number: 14237
Loading