Abstract: Following the success in advancing natural language processing and understanding, transformers are expected to bring revolutionary changes to computer vision. This work provides a comprehensive study on both empirical and certified robustness of vision transformers (ViTs), with analysis that casts light on creating models that resist adversarial attacks. We find that ViTs possess better empirical and certified adversarial robustness when compared with various baselines. In our frequency study, we show features learned by ViTs contain less high-frequency patterns which tend to have spurious correlation, and there is a high correlation between how much the model learns high-frequency features and its robustness against different frequency-based perturbations. Moreover, modern CNN designs that borrow techniques from ViTs including activation function, layer norm, larger kernel size to imitate the global attention, and patchify the images as inputs, etc., could help bridge the performance gap between ViTs and CNNs not only in terms of performance, but also certified and empirical adversarial robustness. Introducing convolutional or tokens-to-token blocks for learning high-frequency features in ViTs can improve classification accuracy but at the cost of adversarial robustness.
1 Reply
Loading