Keywords: Distributed Training, Fault Tolerance, Large Language Model
TL;DR: This work theoretically and empirically studies and mitigates the convergence problem with the gradient aggregation error, which is caused by silent hardware or software issues.
Abstract: Identifying and recovering from hardware failures is important in fault-tolerant distributed training to guarantee system efficiency. However, some hardware-related silent data corruption errors during gradient aggregation, like bit corruptions or communication noise, are difficult to capture and address, leading to slow or failed convergence.
To understand and mitigate these errors, we first mathematically formulate and generalize them as gradient inconsistency. Then, we theoretically analyze how it leads to model divergence accumulated during training and the failed convergence.
Based on the analytical study, we design PAFT, a fault-tolerant distributed training system with dynamic and asynchronous parameter synchronization. PAFT includes two parts: (1) PAFT-Sync, which mitigates model divergence by periodically synchronizing parameters, and (2) PAFT-Dyn, which minimizes synchronization overhead through dynamic training overlap and synchronization frequency scheduling based on profiled error degrees. Together, they ensure efficient model convergence at scale. The fault-tolerant synchronization in PAFT is optimized to support commonly used optimizers, e.g., Stochastic Gradient Descent (SGD), SGD momentum, and Adam.
We implement PAFT on PyTorch Distributed and train ResNet, GPT-2, and LLaMA-2 on 4$\sim$ 32 GPUs. Experimental results show that PAFT efficiently defends against gradient aggregation error degrees while maintaining training performance.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 9448
Loading