Reviving Shift Equivariance in Vision Transformers

ICML 2023 Workshop SCIS Submission49 Authors

Published: 20 Jun 2023, Last Modified: 28 Jul 2023SCIS 2023 PosterEveryoneRevisions
Keywords: machine learning, shift equivariance, vision transformers
Abstract: Shift equivariance, integral to object recognition, is often disrupted in Vision Transformers (ViT) by components like patch embedding, subsampled attention, and positional encoding. Attempts to combine convolutional neural network with ViTs are not fully successful in addressing this issue. We propose an input-adaptive polyphase anchoring algorithm for seamless integration into ViT models to ensure shift-equivariance. We also employ depth-wise convolution to encode positional information. Our algorithms enable ViT, and its variants such as Twins to achieve 100\% consistency with respect to input shift, demonstrate robustness to cropping, flipping, and affine transformations, and maintain consistent predictions even when the original models lose 20 percentage points on average when shifted by just a few pixels with Twins' accuracy dropping from 80.57\% to 62.40\%.
Submission Number: 49
Loading