Keywords: group equivariant neural network, vision transformer, position encoding
TL;DR: We prove that previous attempts on designing group-equivariant ViT not effective in some cases, which is then addressed by a novel, effective equivariant positional encoding.
Abstract: Vision Transformer (ViT) has achieved remarkable performance in computer vision. However, positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data. Ini- tial attempts have been made on designing equiv- ariant ViT but are proved defective in some cases in this paper. To address this issue, we design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding opera- tor. We prove that GE-ViT meets all the theoreti- cal requirements of an equivariant neural network. Comprehensive experiments are conducted on standard benchmark datasets, demonstrating that GE-ViT significantly outperforms non-equivariant self-attention networks. The code is available at https://github.com/ZJUCDSYangKaifan/GEVit.
Supplementary Material: pdf
Other Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/e-equivariant-vision-transformer/code)
0 Replies
Loading