Less Gradient, More Speed: Rethinking Pipeline Parallelism for Efficient Fine-Tuning with FluidPipe

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Pipeline Parallelism, Distributed Training, Learning algorithm, ML Systems
TL;DR: FluidPipe removes the gradient synchronization at every iteration by using an auxiliary task head to compute gradients locally at the first stage.
Abstract: Fine-tuning large pretrained models often uses pipeline parallelism (PP) to split layers across devices. PP is simple to deploy but requires per-iteration cross-stage gradient exchanges, creating pipeline bubbles and making performance highly sensitive to latency. We introduce FluidPipe, a two-stage pipeline design that replaces these gradient exchanges with local updates guided by an auxiliary head and cross-stage bi-directional distillation. This re-design eliminates iteration-time synchronization while preserving model quality. We develop a cost and communication model explaining when FluidPipe outperforms PP, and validate on BERT-Large and ViT-Large fine-tuning, where FluidPipe achieves up to $3.3\times$ faster training while matching or improving accuracy.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 5140
Loading