Detecting Backdoor Attacks in Black-Box Neural Networks through Hardware Performance Counters

Manaar Alam, Yue Wang, Michail Maniatakos

Published: 2024, Last Modified: 28 Sept 2024DATE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep Neural Networks (DNNs) have made significant strides, but their susceptibility to backdoor attacks still remains a concern. Most defenses typically assume access to white-box models or poisoned data, requirements that are often not feasible in practice, especially for proprietary DNNs. Existing defenses in a black-box setting usually rely on confidence scores of DNN's predictions. However, this exposes DNNs to the risk of model stealing attacks, a significant concern for proprietary DNNs. In this paper, we introduce a novel strategy for detecting back-doors, focusing on a more realistic black-box scenario where only hard-label (i.e., without any prediction confidence) query access is available. Our strategy utilizes data flow dynamics in a computational environment during DNN inference to identify potential backdoor inputs and is agnostic of trigger types or their locations in the input. We observe that a clean image and its corresponding backdoor counterpart with a trigger induce distinct patterns across various microarchitectural activities during the inference phase. We exploit these variations captured by Hardware Performance Counters (HPCs) and use principles of the Gaussian Mixture Model to detect backdoor inputs. To the best of our knowledge, this is the first work that utilizes HPCs for detecting backdoors in DNNs. Extensive evaluation considering a range of benchmark datasets, DNN architectures, and trigger patterns shows the efficacy of the proposed method in distinguishing between clean and backdoor inputs using HPCs.