Keywords: Distributed Training, Large Language Models, Ensemble Methods
TL;DR: We combine federated learning with sparse parameter sharing to improve performance whilst reducing wall-clock time.
Abstract: Large language model (LLM) training is typically distributed across many accelerators to reduce training time, necessitating frequent exchange of information across high-speed, low-latency networks. Federated learning algorithms like DiLoCo have relaxed this requirement by grouping accelerators into islands, between which communication is infrequent. In the case of DiLoCo, synchronization between workers happens every $H$ steps, thus reducing the communication cost by a factor of $H$. However, if $H$ is too large, model convergence is affected as nodes performing local optimization diverge too far. In this work, we explore Sparse Parameter Averaging (referred to as SPARTA), where models asynchronously share a small subset of the parameters (e.g., 0.05\%) at each training iteration. This keeps them within the same basin to reduce divergence between models. The main contribution of this paper, is to combine SPARTA with DiLoCo, which provides two benefits over `pure' DiLoCo. First, using SPARTA increases correlation between nodes. This enables a 100× increase in the DiLoCo interval without incurring additional wall-clock time, whilst still achieving performance gains. Second, we show that SPARTA acts as a regularizer, allowing for a higher learning rate and faster convergence.
Submission Number: 39
Loading