MemFerry: A Fast and Memory Efficient Offload Training Framework with Hybrid GPU Computation

Published: 2025, Last Modified: 27 Jan 2026INFOCOM 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the ever-growing size of deep learning models, GPU memory is prone to be insufficient during training. A prominent approach is ZeRO-Offload which moves the optimizer states to CPU memory and performs parameter update using CPU. However, the deficiencies of ZeRO-Offload include low GPU utilization, imperfect overlapping of communication and computation, and inflexible offloading. In this paper, we leverage Direct Host Access (DHA) in GPU that can compute data on CPU memory to form a novel hybrid on-GPU and DHA. We design and implement MemFerry consisting of an execution scheduler and a shadow model. The scheduler strategically chooses layers of parameters for DHA computation and transmits the remaining parameters to GPU memory simultaneously to shorten forward propagation time, and further loads DHA parameters to GPU memory for reducing backward propagation time. The shadow model presents a unified memory abstraction for the parameter partitions stored separately in GPU and CPU memories. To further reduce GPU memory usage, we present GO-MemFerry along with its dynamic programming algorithm that offloads gradients to CPU memory via DHA. Our experiments show that MemFerry trains up to 1.68x faster and GO-MemFerry could train 1.52 x larger model compared to ZeRO-Offload on a single GPU, and increase training speed by at least 28.1 % when scaling to data parallelism on 8 GPUs.
Loading