- Abstract: Deep neural networks are vulnerable to adversarial examples: input data that has been manipulated to cause dramatic model output errors. To defend against such attacks, we propose NeuralFingerprinting: a simple, yet effective method to detect adversarial examples that verifies whether model behavior is consistent with a set of fingerprints. These fingerprints are encoded into the model response during training and are inspired by the use of biometric and cryptographic signatures. In contrast to previous defenses, our method does not rely on knowledge of the adversary and can scale to large networks and input data. The benefits of our method are that 1) it is fast, 2) it is prohibitively expensive for an attacker to reverse-engineer which fingerprints were used, and 3) it does not assume knowledge of the adversary. In this work, we 1) theoretically analyze NeuralFingerprinting for linear models and 2) show that NeuralFingerprinting significantly improves on state-of-the-art detection mechanisms for deep neural networks, by detecting the strongest known adversarial attacks with 98-100% AUC-ROC scores on the MNIST, CIFAR-10 and MiniImagenet (20 classes) datasets. In particular, we consider several threat models, including the most conservative one in which the attacker has full knowledge of the defender's strategy. In all settings, the detection accuracy of NeuralFingerprinting generalizes well to unseen test-data and is robust over a wide range of hyperparameters.
- Keywords: Adversarial Attacks, Deep Neural Networks
- TL;DR: Novel technique for detecting adversarial examples -- robust across gradient-based and gradient-free attacks, AUC-ROC >95%