Why Do Vision Transformers Have Better Adversarial Robustness than CNNs?

Why Do Vision Transformers Have Better Adversarial Robustness than CNNs?

TMLR Paper668 Authors

06 Dec 2022 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Deep-learning models have performed excellently in various fields because of advances in computing power and the large-scale datasets used to train large models. However, they have an inherent risk that even a small change in the input can result in a significantly different output of the trained model. Therefore, it is crucial to evaluate the robustness of deep-learning models before we trust the models’ decisions. In this paper, we evaluate the adversarial robustness of convolutional neural networks (CNNs), vision transformers (ViTs), and CNNs + ViTs, which are typical structures commonly used in computer vision, based on four new model-sensitivity metrics that we propose. These metrics were evaluated for random noise and gradient-based adversarial perturbations. For a fair comparison, models with similar capacities were used in each model group, and the experiment was conducted separately using ImageNet-1K and ImageNet-21K as the pretraining data. The experimental results showed that ViTs were more robust than CNNs for gradient-based adversarial attacks, and our quantitative and qualitative analysis of these results brings to light the cause of the difference.

Submission Length: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~bo_han2

Submission Number: 668

Loading