Pretraining a Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm
Keywords: large language model, decentralized training, pretraining
TL;DR: We propose SPES, a decentralized framework for pretraining MoE LLMs. SPES supports sparse training on weakly connected nodes, reducing memory and communication costs and enabling efficient pretraining on resource-constrained devices.
Abstract: Pretraining large language models (LLMs) typically relies on centralized clusters equipped with hundreds or thousands of high-memory GPUs ($\textit{e.g.}$, H100/A100), creating obstacles for a wide range of exploration in the community. Recent decentralized training methods reduce communication overhead by employing federated optimization; however, these methods still need to store and train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose $\textbf{SP}$arse $\textbf{E}$xpert $\textbf{S}$ynchronization ($\textbf{SPES}$), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a small subset of experts on each node during training, substantially reducing the memory footprint per node. Each node updates its local experts and periodically synchronizes with other nodes, eliminating the need to transmit the full set of model parameters and enabling efficient knowledge sharing across the distributed network. To accelerate convergence, we introduce an expert-merging warm-up strategy. Experts exchange knowledge via model merging in the early training stages, promoting faster establishment of the foundational capabilities for each expert. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs (NVIDIA L40S) over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate the scalability of SPES by training a model up to 7B parameters with open-source data, matching prior centralized baselines. Our SPES pre-training paradigm can be extended to more low-end GPUs and train LLM of larger scales. Code and models will be released.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 941
Loading