Abstract: Mixture-of-Experts (MoE) has been extensively adopted for its incredible capability to expand model scale with a sub-linear increase in computational requirement. Training MoE models requires substantial computing nodes and extended periods, necessitating reliable distributed training systems. Checkpointing is a common approach to enhance training reliability by periodically saving model states. Current checkpointing optimizations focus on hiding checkpoint overhead in model training computations. However, these approaches overlook the dynamicity inherent in distributed MoE training, leading to an inefficient checkpointing mechanism. In this paper, we propose Capricorn, a dynamicity-aware in-memory checkpointing approach for efficient MoE model training. We observe that the dynamicity impacts computation durations at both the layer and iteration levels. At the layer level, different model layers exhibit various computation durations, while at the iteration level, the computation time of the same layer differs across iterations. To adapt to the layer-level dynamicity, Capricorn employs online profiling at the granularity of individual layers. Based on the profiling results, it strategically partitions checkpoints into chunks and schedules checkpointing communication to overlap with model computations. To deal with the dynamicity across iterations, Capricorn speculatively activates the profiling and partitioning processes utilizing the temporal locality of the experts' load. It can produce an optimal activation for low runtime overhead with high checkpoint partition accuracy. For mainstream MoE models, Capricorn achieves up to $1.56 \times$ and $5.98 \times$ end-to-end training speedup over Gemini and TorchSnapshot respectively under per-iteration checkpointing.
External IDs:dblp:conf/cluster/XieLLLWHL25
Loading