DUO: No Compromise to Accuracy Degradation

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Gradient Compression, Communication Computation Overlapping, Efficient Training, Distributed Training
TL;DR: Achieving High Accuracy in Distributed Training Even with Aggressive Gradient Compression
Abstract: Distributed training often suffers from high communication overhead due to large-scale gradient synchronization. Although gradient compression—particularly at 4-bit or even lower precision—significantly reduces transfer volume, it typically results in sacrifice in precision and degradation of the final model accuracy. In this work, we introduce DUO, a distributed training framework designed to mitigate accuracy degradation incurred by gradient compression without involving additional overhead. DUO achieves this by inserting an additional high-precision gradient synchronization step into a previously computation-only phase, so that its communication is fully hidden by computation. We provide a comprehensive theoretical proof of convergence for DUO and validate its effectiveness through extensive pre-training experiments on GPT models. Our results indicate that DUO effectively restores accuracy when using 4-bit gradient compression, achieving performance comparable to uncompressed training. Remarkably, DUO maintains minimal accuracy degradation even under extreme compression scenarios, including 1-bit gradients or complete omission of the low-precision gradient communication step (0-bit transmission).
Supplementary Material: zip
Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)
Submission Number: 8320
Loading