Keywords: group equivariant neural network, vision transformer, position encoding
TL;DR: We prove that previous attempts on designing group-equivariant ViT not effective in some cases, which is then addressed by a novel, effective equivariant positional encoding.
Abstract: Vision Transformer (ViT) has achieved remarkable performance in computer vision. However, positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data. Ini- tial attempts have been made on designing equiv- ariant ViT but are proved defective in some cases in this paper. To address this issue, we design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding opera- tor. We prove that GE-ViT meets all the theoreti- cal requirements of an equivariant neural network. Comprehensive experiments are conducted on standard benchmark datasets, demonstrating that GE-ViT significantly outperforms non-equivariant self-attention networks. The code is available at https://github.com/ZJUCDSYangKaifan/GEVit.
Supplementary Material: pdf
Other Supplementary Material: zip