Beyond Top-K: Structured Sparsification for Compression in Pipeline Parallel

Published: 06 Mar 2025, Last Modified: 04 Apr 2025MCDC @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Decentralised training, pipeline parallel, compression
TL;DR: A column sparsification based compression method for pipeline parallel
Abstract: In decentralized training, efficient communication is critical, particularly when training large-scale models over low-bandwidth, heterogeneous networks. Although gradient compression techniques have proven effective in Distributed Data-Parallel (DDP) settings, extending them to pipeline parallel (PP) training is challenging due to cumulative compression errors that exacerbate with network depth. In this work, we introduce a novel compression framework for PP that preserves the column space of activations and gradients instead of compressing individual elements. We derive tight theoretical error bounds and demonstrate the effectiveness of our method by training models over 80 Mbps connections, achieving up to 90\% compression along with around $2 \times$ training and $12 \times$ inference throughput improvements.
Submission Number: 30
Loading