PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption.
Lay Summary: Pipeline parallelism (PP) is a commonly used technique for training large language models (LLMs). However, its scalability is often limited by high activation memory demands. This study addresses that limitation by exploring the use of memory offloading — moving activations temporarily to lower-cost memory — within the context of PP. Through extensive experiments, the paper shows that in most typical settings, at least half, and sometimes all, of the activation data can be offloaded with minimal performance impact. In cases where full offloading is not feasible, we introduce a new selective offloading strategy that reduces peak memory usage more efficiently than previous methods. By integrating offloading with other optimization techniques, the paper demonstrates improved training speed and reduced memory consumption. Our findings suggest that as the number of pipeline stages increases, the memory required per device decreases significantly. As a result, pipeline parallelism becomes a more attractive and efficient alternative to tensor parallelism, achieving up to a 19% increase in training speed while using less memory.
Link To Code: https://github.com/sail-sg/zero-bubble
Primary Area: Optimization->Large Scale, Parallel and Distributed
Keywords: LLM Training, Pipeline Parallelism, Offloading, Memory Optimization
Submission Number: 15731
Loading