Are Vision Transformers Robust to Patch-wise Perturbations?

Jindong Gu; Volker Tresp; Yao Qin

Are Vision Transformers Robust to Patch-wise Perturbations?

Jindong Gu, Volker Tresp, Yao Qin

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Vision Transformer, Adversarial Robustness

Abstract: The recent advances in Vision Transformer (ViT) have demonstrated its impressive performance in image classification, which makes it a promising alternative to Convolutional Neural Network (CNN). Unlike CNNs, ViT represents an input image as a sequence of image patches. The patch-wise input image representation makes the following question interesting: How does ViT perform when individual input image patches are perturbed with natural corruptions or adversarial perturbations, compared to CNNs? In this work, we conduct a comprehensive study on the robustness of vision transformers to patch-wise perturbations. Surprisingly, we find that vision transformers are more robust to naturally corrupted patches than CNNs, whereas they are more vulnerable to adversarial patches. Based on extensive qualitative and quantitative experiments, we discover that ViT's stronger robustness to natural corrupted patches and higher vulnerability against adversarial patches are both caused by the attention mechanism. Specifically, the attention model can help improve the robustness of vision transformers by effectively ignoring natural corrupted patches. However, when vision transformers are attacked by an adversary, the attention mechanism can be easily fooled to focus more on the adversarially perturbed patches and cause a mistake.

One-sentence Summary: Vision Transformer is more robust to natural patch corruptions than CNNs, whereas it is more vulnerable against adversarial patch attacks.

5 Replies

Loading