Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness

Jörn-Henrik Jacobsen; Jens Behrmann; Nicholas Carlini; Florian Tramèr; Nicolas Papernot

Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness

Jörn-Henrik Jacobsen, Jens Behrmann, Nicholas Carlini, Florian Tramèr, Nicolas Papernot

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Invariance, Robustness, Adversarial Examples

TL;DR: We show that invariance-based adversarial examples are a threat to perturbation robust classifiers both theoretically and practically, e.g., by reducing the accuracy of a defense certified to give 87% accuracy to just 12%.

Abstract: Adversarial examples are malicious inputs crafted to cause a model to misclassify them. In their most common instantiation, "perturbation-based" adversarial examples introduce changes to the input that leave its true label unchanged, yet result in a different model prediction. Conversely, "invariance-based" adversarial examples insert changes to the input that leave the model's prediction unaffected despite the underlying input's label having changed. So far, the relationship between these two notions of adversarial examples has not been studied, we close this gap. We demonstrate that solely achieving perturbation-based robustness is insufficient for complete adversarial robustness. Worse, we find that classifiers trained to be Lp-norm robust are more vulnerable to invariance-based adversarial examples than their undefended counterparts. We construct theoretical arguments and analytical examples to justify why this is the case. We then illustrate empirically that the consequences of excessive perturbation-robustness can be exploited to craft new attacks. Finally, we show how to attack a provably robust defense --- certified on the MNIST test set to have at least 87% accuracy (with respect to the original test labels) under perturbations of Linfinity-norm below epsilon=0.4 --- and reduce its accuracy (under this threat model with respect to an ensemble of human labelers) to 60% with an automated attack, or just 12% with human-crafted adversarial examples.

Original Pdf: pdf

9 Replies

Loading