DETECTING ADVERSARIAL PERTURBATIONS WITH SALIENCYDownload PDF

12 Dec 2017 (modified: 25 Jan 2018)ICLR 2018 Conference Withdrawn SubmissionReaders: Everyone
Abstract: In this paper we propose novel method for detecting adversarial examples by train-ing a binary classifier with both origin data and saliency data. In the case of image classification model, saliency simply explain how the model make decisions by identifying significant pixels for prediction. Perturbing origin image is essentially perturbing saliency of right output w.r.t. origin image. Our approach shows good performance on detecting adversarial perturbations. We quantitatively evaluate generalization ability of the detector where detector trained with strong adver-saries and its’ saliency perform well on weak adversaries. In addition, we further discuss relationship between solving adversary problem and model interpretation, which helps us understand how convolutional neural networks making wrong de-cisions.
Keywords: Adversarial Examples, Detection, Saliency, Model Interpretation
5 Replies

Loading