PipeTune: Tuning Pipeline Parallelism for Efficient Vision-Language Model Training

ICLR 2026 Conference Submission16124 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Pipeline Parallelism, Vision-Language Model, Efficiency
TL;DR: PipeTune improves vision-language model training efficiency by adaptively tuning pipeline parallelism.
Abstract: Training vision-language models (VLMs) efficiently is crucial for advancing multimodal understanding, yet remains challenging due to the heterogeneity of training data. Variations in sequence lengths and modality composition significantly degrade the performance of pipeline parallelism (PP), leading to increased idle times and low hardware utilization. We present PipeTune, a unified framework that systematically mitigates these inefficiencies by jointly optimizing micro-batch construction, ordering, size, and vision encoder computation. PipeTune adopts a computation-aware packing algorithm to balance workloads, dynamically adjusts micro-batch sizes based on sampled data, reorders execution to minimize stalls, and exploits idle times for encoder pre-computation. A lightweight simulator guides runtime decisions, enabling performance optimization without altering training semantics. Across diverse model sizes, dataset mixtures, and hardware configurations, PipeTune consistently accelerates training, achieving up to 40.7\% reduction in iteration time. Our evaluation demonstrates that each optimization component contributes complementary gains, and the overall overhead remains minimal. By holistically addressing data-induced inefficiencies, PipeTune enables more scalable and efficient training of VLMs.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 16124
Loading