Model-based Saliency for the Detection of Adversarial ExamplesDownload PDF

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone
Keywords: Adversarial Examples, Defense, Model-based Saliency
TL;DR: We show that gradients are unable to capture shifts in saliency due to adversarial perturbations and present an alternative adversarial defense using learnt saliency models that is effective against both black-box and white-box attacks.
Abstract: Adversarial perturbations cause a shift in the salient features of an image, which may result in a misclassification. We demonstrate that gradient-based saliency approaches are unable to capture this shift, and develop a new defense which detects adversarial examples based on learnt saliency models instead. We study two approaches: a CNN trained to distinguish between natural and adversarial images using the saliency masks produced by our learnt saliency model, and a CNN trained on the salient pixels themselves as its input. On MNIST, CIFAR-10 and ASSIRA, our defenses are able to detect various adversarial attacks, including strong attacks such as C&W and DeepFool, contrary to gradient-based saliency and detectors which rely on the input image. The latter are unable to detect adversarial images when the L_2- and L_infinity- norms of the perturbations are too small. Lastly, we find that the salient pixel based detector improves on saliency map based detectors as it is more robust to white-box attacks.
Original Pdf: pdf
10 Replies

Loading