SP-ViT: Learning 2D Spatial Priors for Vision Transformers

Yuxuan Zhou, Wangmeng Xiang, Chao Li, Biao Wang, Xihan Wei, Lei Zhang, Margret Keuper, Xian-Sheng Hua

16 Nov 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: Transformers have shown great potential in image classification and established state-of-the-art results on the ImageNet benchmark. In contrast to CNNs which leverage the local correlation properties of image content, the spatial arrangement of an image is dissolved in transformers at the input level. As a result, vision transformers (ViT) are initially unbiased in learning spatial relationships from data, where nearby pixels have the same chance of interacting as far away pixels and complex relationships can be learned more easily. Yet, due to their large capacity, ViTs converge more slowly and are prone to overfitting in low-data regimes. To overcome this limitation, we propose Spatial Prior – enhanced Self-Attention (SP-SA), a novel variant of Self-Attention (SA) tailored for ViTs. Unlike convolutional inductive biases, which focus exclusively on hard-coded local regions, the proposed Spatial Priors are learned by the model itself and take a variety of complementary spatial relations into account. Experiments show that SP-SA consistently improves the performance of ViT models. We denote the resulting models SP-ViT. Taking a recently proposed vision transformer LV-ViT as an example, when equipped with SP-SA, the largest model achieves a record-equalling 86.1% Top-1 accuracy with nearly halved parameters (150M SP-ViT-L↑384 vs 271M CaiT-M-36↑384) among all ImageNet-1K models trained on 224x224 and fine-tuned on 384x384 resolution w/o extra data.

0 Replies