Abstract: Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants, but is often bottlenecked by network communication, particularly under pipeline parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited.
To address these issues, we propose TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework for pipeline parallelism. TAH-Quant integrates fine-grained tile-wise quantization, entropy-guided tile-wise adaptive bit allocation for bit usage, and a Hadamard-based transformation with pivot swapping to effectively suppress outliers. We prove that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of $\mathcal{O}(1/\sqrt{T})$, matching that of vanilla stochastic gradient descent. Extensive experiments demonstrate that \sys achieves an aggressive activation quantization ratio of 3--4 bits, providing up to $4.3\times$ throughput speedup over uncompressed FP32 and up to $1.33\times$ wall-clock speedup over AQ-SGD, while preserving training convergence, avoiding AQ-SGD's activation-cache overhead, and generalizing well across various training scenarios.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Reza_Babanezhad_Harikandeh1
Submission Number: 8854
Loading