Gradient-Constrained Training for Distributed Large Language Models

Gradient-Constrained Training for Distributed Large Language Models

ICLR 2026 Conference Submission22580 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Gradient Compression, Rate-Distortion Optimization, Distributed Training

Abstract: Computational constraints make distributed architectures essential for working with large-Language models (LLMs), while inter-node gradient synchronization often becomes a major bottleneck in the distributed parallel training. Current compression techniques mainly aim to reduce communication volume for the computed gradients, instead of generating gradients with inherent sparsity directly during training. In this paper, we propose gradient constrained training (GCT), a novel approach that leverages gradient constraints to generate low-rate gradients. By balancing performance and rate, we directly form an effectively training-time gradient source, achieving high compression efficiency with no accuracy degradation. In extensive experiments, we observed that GCT provides at least 70\% average bitrate savings and demonstrates consistent and stable improvements in coding efficiency across various model tasks and distributed systems, which indicates that GCT have profound implications for next-generation distributed model training and stable gradient transmission.

Supplementary Material: zip

Primary Area: optimization

Submission Number: 22580

Loading