Towards Reliable AI Applications via Algorithm-Based Fault Tolerance on NVDLA

Published: 01 Jan 2022, Last Modified: 01 Mar 2025MSN 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the development of deep neural networks (DNNs), more complex accelerators have been designed for more sophisticated networks. Naturally, the complexity of accelerators makes them vulnerable to transient errors. Also, some DNN accelerators are widely used the safety-critical systems, such as autonomous vehicles. Therefore, the susceptibility to transient errors makes research on mitigation techniques more significant, and errors of accelerators should be limited to none. Some researchers proposed the modular redundancy method, which offers a highly reliable way but also considerably increases overhead. In this regard, algorithm-based solutions offer cheaper solutions. However, their implementation is primarily observed in software-based error injections. In this study, we propose a novel approach that focuses on implementing algorithm-based error detection (ABED) for RTL-level (hardware-based) error injections. Previous studies generally focused on the impact of soft errors in memory structures of embedded system-based accelerators. However, the main goal of this research is to study the impact of soft errors in processing elements and how to mitigate them. We implement an algorithm-based error detection that utilizes checksums for verifying convolution operations with low overhead. We first explain how to overcome the challenges of implementing ABED on FPGA-based accelerators, then how to implement it. We implement and evaluate our solution on an industry-level DNN accelerator called NVIDIA deep learning accelerator (NVDLA). In this study, our error injection method is constructed to test the most common soft error scenarios in processing units. The results of the research show that algorithm-based fault tolerance can detect all silent data corruptions (SDC) while maintaining a very low overhead (6-23%) on runtime.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview