Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision. However, they are still susceptible to adversarial examples. In this paper, we propose a novel adversarial attack method tailored for ViTs, by leveraging the inherent permutation-invariant of ViTs to generate highly transferable adversarial examples. Specifically, we split the image into patches of different scales and permute the local patches to generate diverse inputs. By optimizing perturbations on the permuted image set, we can prevent the generated adversarial examples from overfitting to the surrogate model, thereby enhancing transferability. Extensive experiments conducted on ImageNet demonstrate that the permutation-invariant (PI) attack significantly improves transferability between ViTs and from ViTs to CNNs. PI is applicable to diverse ViTs and can seamlessly integrate with existing attack methods further enhancing transferability. Our approach surpasses state-of-the-art ensemble methods for input transformation and achieves a notable performance improvement of 11.9% on average.
Loading