Keywords: GPU Memory, GPUDirect Storage, Large Language Model, Offloading, Cost-Efficient Training, Solid State Drives
Abstract: We present the design and implementation of a new lifetime-aware tensor offloading
framework for GPU memory expansion using low-cost PCIe-based solid-state
drives (SSDs). Our framework, TERAIO, is developed explicitly for large language
model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven
by our observation that the active tensors take only a small fraction (1.7% on
average) of allocated GPU memory in each LLM training iteration, the inactive
tensors are usually large and will not be used for a long period of time, creating
ample opportunities for offloading/prefetching tensors to/from slow SSDs without
stalling the GPU training process. TERAIO accurately estimates the lifetime (active
period of time in GPU memory) of each tensor with the profiling of the first few
iterations in the training process. With the tensor lifetime analysis, TERAIO will
generate an optimized tensor offloading/prefetching plan and integrate it into the
compiled LLM program via PyTorch. TERAIO has a runtime tensor migration
engine to execute the offloading/prefetching plan via GPUDirect storage, which
allows direct tensor migration between GPUs and SSDs for alleviating the CPU
bottleneck and maximizing the SSD bandwidth utilization. In comparison with
state-of-the-art studies such as ZeRO-Offload and ZeRO-Infinity, we show that
TERAIO improves the training performance of various LLMs by 1.47× on average,
and achieves 80.7% of the ideal performance assuming unlimited GPU memory.
Supplementary Material: zip
Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)
Submission Number: 23499
Loading