Pfeife: Automatic Pipeline Parallelism for PyTorch

Ho Young Jhoo; Chung-Kil Hur; Nuno P. Lopes

Pfeife: Automatic Pipeline Parallelism for PyTorch

Ho Young Jhoo, Chung-Kil Hur, Nuno P. Lopes

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Implementation of fast and general automatic pipeline parallelism for PyTorch using torch.compile.

Abstract: The memory requirements of machine learning (ML) models has been growing quickly. However, the memory capacity of GPUs has not kept pace. Despite significant research on reducing the memory usage of ML models, the larger models do not fit in a single device. A popular solution to the memory capacity issue is to use multiple devices in parallel. In this paper, we focus on a particular form of parallelism called pipelining, as it offers a good balance between cost and performance for many ML models. We present Pfeife, the first tool that integrates with PyTorch to provide automatic pipelining of ML models. Pfeife intercepts the execution of models and parallelizes them transparently, requiring no manual work. We show that Pfeife can execute large models that would otherwise not run due to not fitting in a single device. Moreover, Pfeife can pipeline non-sequential models such as Stable Diffusion, which are not supported by existing pipelining parallelism tools. Pfeife outperforms state-of-the-art tools by up to 22%.

Lay Summary: Modern AI models contain billions of parameters, but current hardware devices don't have enough memory to store all of them. To overcome this limitation, AI researchers distribute these large models across multiple devices. Pipelining is one effective solution for this distribution challenge. It divides an AI model into several execution stages and assigns these stages to different devices. The input data then flows through these stages across devices like items on a conveyor belt. However, automatic generation of a pipeline for AI model training has been challenging because determining the optimal model partitioning for maximum training speed is complex. We have developed Pfeife, a tool that automatically slices large models and facilitates training through pipelining. Our research demonstrates that Pfeife can successfully train complex models that previously could not be automatically pipelined, while also operating up to 22% faster than existing pipelining tools.

Link To Code: https://github.com/MerHS/pfeife

Primary Area: General Machine Learning->Hardware and Software

Keywords: Parallel Training, Automatic Parallelism, Pipeline Parallelism, PyTorch

Submission Number: 726

Loading