Abstract: Deep Neural Networks (DNNs) have been widely deployed in a range of safety-critical applications. Recent work has illustrated their vulnerability to malicious backdoor attacks, which lead to DNN malfunction when a specific backdoor trigger is applied to the DNN input image. These backdoors cause uncharacteristic behavior in DNN hidden layers, causing the DNN to misclassify the input image. In this work we present Transform-Enabled Detection of Attacks (TESDA), a novel algorithm for on-line detection of uncharacteristic behavior in DNN hidden layers indicative of a backdoor. We leverage the training-dataset distributions of reduced-dimension transforms of deep features in a backdoored DNN to rapidly detect malicious behavior, using theoretically grounded methods with bounded false alarm rates. We verify that TESDA is able to achieve state-of-the-art detection with very low latency on a variety of attacks, datasets and network backbones. Further ablations show that only a small proportion of DNN training data is needed for TESDA to fit an attack detector to the backdoored network.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Chao_Chen1
Submission Number: 4008
Loading