DeCo-SGD: Joint Optimization of Delay Staleness and Gradient Compression Ratio for Distributed SGD

Published: 2025, Last Modified: 23 Jan 2026CoRR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Regional energy caps limit the growth of any single data center used for large-scale model training. This single-center training paradigm works when model size remains manageable, but exponential growth in the model size and computational demand challenges it. A natural alternative is to distribute training across multiple data centers over wide-area networks. This pools distributed resources, but suffers from high latency and low, time-varying bandwidth, sharply reducing throughout. Employing jointly gradient compression and delayed aggregation can alleviate communication problems, but introduces a complex three-way trade-off among compression ratio, staleness (delayed synchronization steps), and convergence rate. Existing work lacks theoretical guidance and can only propose fixed strategies, insensitive to computation and communication conditions. We address this with a new theoretical tool, decomposing the joint optimization problem into a traditional process plus multiple analyzable noise terms. Our analysis yields the first convergence rate for this setting and shows that increasing staleness exponentially amplifies the detrimental effect of compression. Leveraging these insights, we propose DeCo-SGD, which dynamically selects the compression ratio and staleness based on the real-time communication and computation conditions. DeCo-SGD achieves up to $5.07\times$ and $1.37\times$ speed-ups over distributed SGD and static strategy in high-latency and low, varying bandwidth networks, respectively.
Loading