Neural Network Based Silent Error Detector

Chen Wang, Nikoli Dryden, Franck Cappello, Marc Snir

2018 (modified: 30 Mar 2022)CLUSTER 2018Readers: Everyone

Abstract: As we move toward exascale platforms, silent data corruptions (SDC) are likely to occur more frequently. Such errors can lead to incorrect results. Attempts have been made to use generic algorithms to detect such errors. Such detectors have demonstrated high precision and recall for detecting errors, but only if they run immediately after an error has been injected. In this paper, we propose a neural network detector that can detect SDCs even multiple iterations after they were injected. We have evaluated our detector with 6 FLASH applications and 2 Mantevo mini-apps. Experiments show that our detector can detect more than 89% of SDCs with a false positive rate of less than 2%.

0 Replies