Why Do Vision Transformers Have Better Adversarial Robustness than CNNs?

TMLR Paper668 Authors

06 Dec 2022 (modified: 28 Feb 2023)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: Deep-learning models have performed excellently in various fields because of advances in computing power and the large-scale datasets used to train large models. However, they have an inherent risk that even a small change in the input can result in a significantly different output of the trained model. Therefore, it is crucial to evaluate the robustness of deep-learning models before we trust the models’ decisions. In this paper, we evaluate the adversarial robustness of convolutional neural networks (CNNs), vision transformers (ViTs), and CNNs + ViTs, which are typical structures commonly used in computer vision, based on four new model-sensitivity metrics that we propose. These metrics were evaluated for random noise and gradient-based adversarial perturbations. For a fair comparison, models with similar capacities were used in each model group, and the experiment was conducted separately using ImageNet-1K and ImageNet-21K as the pretraining data. The experimental results showed that ViTs were more robust than CNNs for gradient-based adversarial attacks, and our quantitative and qualitative analysis of these results brings to light the cause of the difference.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~bo_han2
Submission Number: 668
Loading