Keywords: distributed optimization, low-bandwidth training, language modeling
Abstract: We introduce a memory- and compute-efficient method for low-communication distributed training.
Existing methods reduce communication by performing multiple local updates between infrequent
global synchronizations. We demonstrate that their efficiency can be significantly improved by
restricting backpropagation: instead of updating all the parameters, each node updates only a fixed
subset while keeping the remainder frozen during local steps. This constraint substantially reduces
peak memory usage and training FLOPs, while a full forward pass over all parameters eliminates
the need for cross-node activation exchange. Experiments on a 1.3B-parameter language model
trained across 32 nodes show that our method matches the perplexity of prior low-communication
approaches under identical token and bandwidth budgets while reducing training FLOPs and peak
memory.
Submission Number: 64
Loading