A Simple Unsupervised Data Depth-based Method to Detect Adversarial Images

A Simple Unsupervised Data Depth-based Method to Detect Adversarial Images

TMLR Paper1028 Authors

04 Apr 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Deep neural networks suffer from critical vulnerabilities regarding robustness, which limits their exploitation in many real-world applications. In particular, a serious concern is their inability to defend against adversarial attacks. Although the research community has developed a large number of effective attacks, the detection problem has received little attention. Existing detection methods rely on either additional training or specific heuristics at the risk of overfitting. Moreover, they have mainly focused on ResNet architectures, while transformers, which are state-of-the-art for vision tasks, have yet to be properly investigated. In this paper, we overcome these limitations by introducing APPROVED, a simple unsupervised detection method for transformer architectures. It leverages the information available in the logit layer and computes a similarity score with respect to the training distribution. This is accomplished using a data depth that is: (i) computationally efficient; and (ii) non-differentiable, making it harder for gradient-based adversaries to craft malicious samples. Our extensive experiments show that APPROVED consistently outperforms previous detectors on CIFAR10, CIFAR100, and Tiny ImageNet.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Fuxin_Li1

Submission Number: 1028

Loading