Self-Supervised Adversarial Example Detection by Disentangled Representation

Zhaoxi Zhang; Leo Yu Zhang; Xufei Zheng; Shengshan Hu; Jinyu Tian; Jiantao Zhou

Self-Supervised Adversarial Example Detection by Disentangled Representation

Zhaoxi Zhang, Leo Yu Zhang, Xufei Zheng, Shengshan Hu, Jinyu Tian, Jiantao Zhou

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone

Keywords: Adversarial examples, adversarial detection, deep neural network, disentangled representation

TL;DR: Mimic the behavior of adversarial examples and reduce the unnecessary generalization ability of autoencoder by disentangled representations for adversarial detection.

Abstract: Deep learning models are known to be vulnerable to adversarial examples that are elaborately designed for malicious purposes and are imperceptible to the human perceptual system. Autoencoder, when trained solely over benign examples, has been widely used for (self-supervised) adversarial detection based on the assumption that adversarial examples yield larger reconstruction error. However, because lacking adversarial examples in its training and the too strong generalization ability of autoencoder, this assumption does not always hold true in practice. To alleviate this problem, we explore to detect adversarial examples by disentangled representations of images under the autoencoder structure. By disentangling input images as class features and semantic features, we train an autoencoder, assisted by a discriminator network, over both correctly paired class/semantic features and incorrectly paired class/semantic features to reconstruct benign and counterexamples. This mimics the behavior of adversarial examples and can reduce the unnecessary generalization ability of autoencoder. We compare our method with the state-of-the-art self-supervised detection methods under different adversarial attacks and different victim models ($30$ attack settings), and it exhibits better performance in various measurements (AUC, FPR, TPR) for most attack settings. Ideally, AUC is $1$ and our method achieves $0.99+$ on CIFAR-10 for all attacks. Notably, different from other Autoencoder-based detectors, our method can provide resistance to the adaptive adversary.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: zip

12 Replies

Loading