Abstract: Distributed large model training is intensively time and resource consuming. Failures during the long training period are often inevitable, and can incur substantial recovery costs. Checkpointing has been the standard fault tolerance approach, which periodically stores the latest model states at remote persistent storage. This process can be time-consuming due to limited network bandwidth, and adversely affects training throughput. In-memory checkpointing addresses this issue by saving checkpoint data into host memory instead of remote storage. However, host memory is non-persistent, and may not provide sufficient resilience in case of machine failure. We propose ECCheck, a novel in-memory checkpoint system that employs erasure coding to enhance fault tolerance in distributed deep neural network training. ECCheck advocates serialization-free encoding and decoding in model checkpointing. Several techniques are proposed to minimize computation and communication overhead incurred by erasure coding. Extensive experiments demonstrate that ECCheck achieves superior fault tolerance compared to state-of-the-art solutions, while maintaining high checkpointing frequency, low checkpointing stalls, and fast recovery from failures.
External IDs:dblp:conf/icdcs/QiL0PZ25
Loading