Meta-learning Optimizers for Communication-Efficient Learning

Charles-Étienne Joseph; Benjamin Thérien; Abhinav Moudgil; Boris Knyazev; Eugene Belilovsky

Meta-learning Optimizers for Communication-Efficient Learning

Charles-Étienne Joseph, Benjamin Thérien, Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky

Published: 18 Mar 2025, Last Modified: 18 Mar 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches compute multiple gradient steps locally on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art adaptive optimizers for deep learning. In this work, we investigate if the recent progress in the emerging area of learned optimizers can potentially close this gap in homogeneous data and homogeneous device settings while remaining communication-efficient. Specifically, we meta-learn how to perform global updates given an update from local SGD iterations. Our results demonstrate that learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. Our learned optimizers can even generalize to unseen and much larger datasets and architectures, including ImageNet and ViTs, and to unseen modalities such as language modeling. We therefore show the potential of learned optimizers for improving communication-efficient distributed learning.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We have made the following changes to the manuscript: - We added precisions about the new SGD baseline in Figures 2,3, and 10. - We updated the caption of Figure 2 and section 5.1.3 w.r.t. the convergence of the ImageNet models. - We corrected a minor discrepancy (the batch size was incorrectly set) with our Adam baseline impacting only the LM1B results in subfigures (g,h).

Code: https://github.com/lefameuxbeding/learned_aggregation

Supplementary Material: zip

Assigned Action Editor: ~Tatiana_Likhomanenko1

Submission Number: 3290

Loading