Keywords: Gradient Compression, Rate-Distortion Optimization, Distributed Training
Abstract: Computational constraints make distributed architectures essential for working with large-Language models (LLMs), while inter-node gradient synchronization often becomes a major bottleneck in the distributed parallel training. Current compression techniques mainly aim to reduce communication volume for the computed gradients, instead of generating gradients with inherent sparsity directly during training. In this paper, we propose gradient constrained training (GCT), a novel approach that leverages gradient constraints to generate low-rate gradients. By balancing performance and rate, we directly form an effectively training-time gradient source, achieving high compression efficiency with no accuracy degradation. In extensive experiments, we observed that GCT provides at least 70\% average bitrate savings and demonstrates consistent and stable improvements in coding efficiency across various model tasks and distributed systems, which indicates that GCT have profound implications for next-generation distributed model training and stable gradient transmission.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 22580
Loading