Self-examination Mechanism: Improving Standard Accuracy in Adversarial Purification

Sora Suegami, Yutaro Oguri, Zaiying Zhao, Yu Kagaya, Toshihiko Yamasaki

Published: 2024, Last Modified: 07 Nov 2025IVSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning-based image classification models are vulnerable to adversarial examples. While existing defense methods have improved the classification accuracy for adversarial examples, that for clean images (i.e., without perturbations) often decreases. To address this problem, we propose a new defense mechanism called self-examination. In our self-examination mechanism, the input image is first classified whether it is attacked or not. Then, the inference process of the classification model is verified using SHapley Additive exPlanations (SHAP), a method of explainable artificial intelligence (XAI). If the input image is determined to be attacked (i.e., an adversarial example), the method outputs the result revised by the reclassification model. Otherwise, the model outputs the result of the classifier as it is. Thus, the misclassification of adversarial examples can be prevented without significantly reducing the classification accuracy of clean images. We construct two reclassification models: one uses the XAI method and the other uses diffusion-based adversarial purification. We evaluated our method on WideResNet trained with CIFAR-10. Experimental results show negligible impact on clean image classification accuracy for both reclassifiers and an improvement in classification accuracy for adversarial examples for the latter reclassifier.