A Few Adversarial Tokens Can Break Vision Transformers

TMLR Paper1043 Authors

07 Apr 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision transformers rely on self-attention operations between disjoint patches (tokens) of an input image, in contrast with standard convolutional networks. We investigate fundamental differences between the adversarial robustness properties of these two families of models when subjected to adversarial token attacks (i.e., where an adversary can modify a tiny subset of input tokens). We subject various transformer and convolutional models with token attacks of varying patch sizes. Our results show that vision transformer models are much more sensitive to token attacks than the current best convolutional models, with SWIN outperforming transformer models by up to $\sim20\%$ in robust accuracy for single token attacks. We also show that popular vision-language models such as CLIP are even more vulnerable to token attacks. Finally, we also demonstrate that a simple architectural operation (shifted windowing), which is used by transformer variants such as SWIN, can significantly enhance robustness to token attacks. Further, using SWIN as a backbone for vision-language models improves robustness to token attacks. Our evaluation therefore suggests that using SWIN backbones or BEiT style pretraining results in models more robust to token attacks.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Changyou_Chen1
Submission Number: 1043
Loading