HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a hierarchical asynchronous optimization method that accelerates geo-distributed LLM training by combining local and global model updates with theoretical guarantees, achieving faster convergence without sacrificing accuracy.
Abstract: Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5× faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1×. Crucially, HALoS preserves the model quality of fully synchronous SGD—matching or exceeding accuracy on standard language modeling and downstream benchmarks—while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.
Lay Summary: Training today’s LLMs typically packs thousands of identical GPUs into one data center and keeps them perfectly synchronized. That approach only works if you gather that hardware in one place and pay for an ultra-fast network. Many teams instead spread GPUs across multiple cloud regions, where links are slow and hardware heterogeneous. Even hyperscalers are moving to multi-region deployments because of operational challenges. Our paper introduces HALoS, a simple hierarchical, asynchronous layer that keeps learning efficient in those settings. Each region runs local servers that collect updates from nearby GPUs and talk only occasionally to one global server. Because most messages stay on the fast regional network, accelerators rarely sit idle, and the few long-distance exchanges are efficiently merged instead of sent one by one. In experiments that mimic realistic geo-distributed settings, HALoS pretrains LLMs significantly faster than existing synchronous and asynchronous baselines while preserving model performance on downstream tasks like MMLU and Hellaswag. We expect HALoS to be a practical, powerful solution for efficient LLM pretraining in the upcoming geo-distributed training environment.
Link To Code: https://github.com/utnslab/halos
Primary Area: Optimization->Large Scale, Parallel and Distributed
Keywords: Hierarchical Geo-Distributed LLM Training, Asynchronous Optimization, Local SGD with Momentum
Submission Number: 3285
Loading