Keywords: Low rank, compression, LLM, Pipeline paralle.
Abstract: Pipeline parallelism enables training of large language models that exceed single-device memory, yet inter-stage activation communication becomes the dominant bottleneck when training spans low-bandwidth, heterogeneous networks. Recent work in this area has proposed using fixed orthogonal projections to compress activations. However, this still results in a significant performance degradation and requires intrusive training changes for constrained optimization. A natural alternative is to learn a low rank projection for each pipeline stage, however maintaining the necessary orthogonality of these projectors during training remains a challenge. We present \textbf{MDCP-PP} (Manifold and Dictionary Constrained Projection for Pipelined Parallelism), a method that treats inter-stage compression as a learnable orthogonal projection under explicit Stiefel manifold (orthogonal matrices) constraints. Rather than prescribing a fixed global subspace, MDCP-PP lets each pipeline stage discover and continuously adapt its own task-optimal compression subspace via manifold-constrained steepest descent. To recover token-specific signals at stage boundaries, we introduce per-stage factorized anchor embeddings that allow for full-rank activation reconstruction with negligible communication overhead.
We further incorporate residual vector quantization with a streaming codebook synchronization protocol that amortizes dictionary communication.
Across LLaMA models from 150M to 1B parameters and 4 to 8 stage pipelines, MDCP-PP achieves 4$\times$--8$\times$ inter-stage compression within $\sim$2\% of the uncompressed validation loss, and extends to 16$\times$ compression with vector quantization at $\sim$3\% degradation.
Submission Number: 136
Loading