Keywords: silent data corruption, bit-flip errors, fault tolerance, distributed training, large language models
TL;DR: We propose a lightweight detector for silent data corruption in distributed LLM training that integrates into the communication layer, achieving high fault detection with low time overhead.
Abstract: Reliable detection of silent data corruption (SDC), such as bit-flip errors, is critical in large-scale neural network training, as undetected hardware faults can silently propagate and severely degrade model performance. We introduce a lightweight detection method integrated directly before collective communication steps, enabling localization of faulty devices with minimal runtime overhead. Our approach combines statistical modeling of gradient norms with divergence-based criteria to improve robustness. Experiments on large-scale training workloads, including LLaMA2-7B, show that our detector successfully identifies the vast majority of high-order bit-flip faults in bfloat16 while incurring only a very small computational overhead, offering a strong balance between detection accuracy and efficiency.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 11556
Loading