An Efficient Checkpointing System for Large Machine Learning Model Training

Wubiao Xu, Xin Huang, Shiman Meng, Weiping Zhang, Luanzheng Guo, Kento Sato

Published: 2024, Last Modified: 05 Mar 2025SC Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Checkpointing is one of the fundamental techniques to resume training while system fails. It has been generally used in various domains, such as high-performance computing (HPC) and machine learning (ML). However, as machine learning models increase in size and complexity rapidly, the cost of checkpointing in ML training became a bottleneck in storage and performance (time). For example, the latest GPT-4 model has massive parameters at the scale of 1.76 trillion. It is highly time and storage consuming to frequently writes the model to checkpoints with more than 1 trillion floating point values to storage. This work aims to understand and attempt to mitigate this problem. First, we characterize the checkpointing interface in a collection of representative large machine learning/language models with respect to storage consumption and performance overhead. Second, we propose the two optimizations: i) A periodic cleaning strategy that periodically cleans up outdated checkpoints to reduce the storage burden; ii) A data staging optimization that coordinates checkpoints between local and shared file systems for performance improvement. The experimental results with GPT-2 variants show that, overall the proposed optimizations significantly reduce the storage consumption to a constant while improves performance by average 2.1× for checkpointing in GPT-2 training.