Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

08 May 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Low rank, compression, LLM, Pipeline paralle.
Abstract: Pipeline parallelism enables training of large language models that exceed single-device memory, yet inter-stage activation communication becomes the dominant bottleneck when training spans low-bandwidth, heterogeneous networks. Recent work in this area has proposed using fixed orthogonal projections to compress activations. However, this still results in a significant performance degradation and requires intrusive training changes for constrained optimization. A natural alternative is to learn a low rank projection for each pipeline stage, however maintaining the necessary orthogonality of these projectors during training remains a challenge. We present \textbf{MDCP-PP} (Manifold and Dictionary Constrained Projection for Pipelined Parallelism), a method that treats inter-stage compression as a learnable orthogonal projection under explicit Stiefel manifold (orthogonal matrices) constraints. Rather than prescribing a fixed global subspace, MDCP-PP lets each pipeline stage discover and continuously adapt its own task-optimal compression subspace via manifold-constrained steepest descent. To recover token-specific signals at stage boundaries, we introduce per-stage factorized anchor embeddings that allow for full-rank activation reconstruction with negligible communication overhead. We further incorporate residual vector quantization with a streaming codebook synchronization protocol that amortizes dictionary communication. Across LLaMA models from 150M to 1B parameters and 4 to 8 stage pipelines, MDCP-PP achieves 4$\times$--8$\times$ inter-stage compression within $\sim$2\% of the uncompressed validation loss, and extends to 16$\times$ compression with vector quantization at $\sim$3\% degradation.
Submission Number: 136
Loading