Abstract: Understanding the adversarial robustness of Vision Transformers (ViTs) has long been needed since the vulnerability of neural networks hinders their use of it. We present an approach that decomposes the network into submodules and calculates the maximal singular value for each module w.r.t. input, which is a good indication of adversarial robustness. To understand whether Multi-head Self-Attention (MSA) in ViTs contributes to its adversarial robustness, we replace the module with convolutional layers with our decomposing method and conclude that MSA has limited power to defend against adversarial attacks.
0 Replies
Loading