Lightweight Detection of Silent Data Corruption in Distributed Deep Learning

Daniil Li; Artem Danilishin; Rodin Dmitry; Shevelev Alexander; Ryazanov Daniil Alexandrovich; Valery Sandul; Klyzhenko Vadim; huajingling wu; Yu Dequan; Zihao Wang; Shijia Hua; LEI JIANG; FAN WU

Lightweight Detection of Silent Data Corruption in Distributed Deep Learning

Daniil Li, Artem Danilishin, Rodin Dmitry, Shevelev Alexander, Ryazanov Daniil Alexandrovich, Valery Sandul, Klyzhenko Vadim, huajingling wu, Yu Dequan, Zihao Wang, Shijia Hua, LEI JIANG, FAN WU

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: silent data corruption, bit-flip errors, fault tolerance, distributed training, large language models

TL;DR: We propose a lightweight detector for silent data corruption in distributed LLM training that integrates into the communication layer, achieving high fault detection with low time overhead.

Abstract: Reliable detection of silent data corruption (SDC), such as bit-flip errors, is critical in large-scale neural network training, as undetected hardware faults can silently propagate and severely degrade model performance. We introduce a lightweight detection method integrated directly before collective communication steps, enabling localization of faulty devices with minimal runtime overhead. Our approach combines statistical modeling of gradient norms with divergence-based criteria to improve robustness. Experiments on large-scale training workloads, including LLaMA2-7B, show that our detector successfully identifies the vast majority of high-order bit-flip faults in bfloat16 while incurring only a very small computational overhead, offering a strong balance between detection accuracy and efficiency.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 11556

Loading