Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications

Published: 2024, Last Modified: 21 Apr 2026IPDPS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Due to the increasing scale of high-performance computing (HPC) systems, transient hardware faults have become a major reliability concern. Consequently, Silent Data Corruptions (SDCs) due to these faults have been a common insidious consequence in GPU applications. Developers often measure the application resilience with a set of program test inputs available in the benchmark suite, assuming the resilience would not fluctuate much among different inputs. However, we observe that this assumption often results in an over-optimistic evaluation for GPU applications. As a result, the subsequent SDC protection following the evaluation can hardly meet the expected reliability bar in the production environment, where applications would run with potentially arbitrary input values. To this end, we propose Druto – a compiler-based automated technique that searches for inputs to incrementally approach the upper bound of a GPU application’s SDC probability. We develop Druto based on the property that the resilience profiles of a small group of representative threads in a GPU kernel can approximately rank various inputs in terms of the overall SDC probability. Therefore, Druto strategically steers the search towards new program inputs that efficiently portray the overall SDC probability. Evaluation shows that the SDC probability derived from Druto’s input generation is as much as 74× higher than that from existing techniques. Moreover, existing techniques cannot find our generated inputs even given 5× more search time.
Loading