Abstract: Modern deep neural networks (DNNs) are deployed across a wide range of applications, from medical robotics to autonomous driving, where safety and reliability are key concerns. The complexity, speed, and low-power operation of the underlying hardware makes them vulnerable to soft errors that corrupt the results of computations and memory accesses. Existing approaches to error resilience are either expensive in terms of overhead, require DNN retraining or applicable to only specific hardware domains. In contrast, we present a novel error resilience approach that does not require DNN retraining and scales across computation as well as weight parameter errors. In the proposed methodology, the statistics of gradients of neuron output values relative to adjacent neurons in an ordering of neurons allow tight theoretically grounded thresholding of neuron outputs to diagnose erroneous neuron outputs. These are then set to zero (suppressed) for error resilience. A low-overhead error diagnosis module is used for this purpose and is designed using gradient statistics collected across the training dataset of the DNN. Our approach is compared against state of the art error resilience techniques and validated on multiple datasets, networks and error scenarios as well a hardware test case.
Loading