Abstract: In this paper we propose novel method for detecting adversarial examples by train-ing a binary classifier with both origin data and saliency data. In the case of image classification model, saliency simply explain how the model make decisions by identifying significant pixels for prediction. Perturbing origin image is essentially perturbing saliency of right output w.r.t. origin image. Our approach shows good performance on detecting adversarial perturbations. We quantitatively evaluate generalization ability of the detector where detector trained with strong adver-saries and its’ saliency perform well on weak adversaries. In addition, we further discuss relationship between solving adversary problem and model interpretation, which helps us understand how convolutional neural networks making wrong de-cisions.
Keywords: Adversarial Examples, Detection, Saliency, Model Interpretation
5 Replies
Loading