O-ViT: Orthogonal Vision TransformerDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Abstract: Inspired by the tremendous success of self-attention mechanism in natural language processing, the Vision Transformer (ViT) creatively applies it to image patch sequences and achieves incredible performance. However, ViT brings about feature redundancy and low utilization of model capacity. To address this problem, we propose a novel and effective method named Orthogonal Vision Transformer (O-ViT), to optimize ViT from the geometric perspective. O-ViT limits parameters of self-attention blocks to reside on the orthogonal manifold, which can reduce the similarity between trainable parameters and construct a higher degree of distinction between features. Moreover, O-ViT achieves both orthogonal constraints and negligible optimization overhead by adopting a surjective mapping between the orthogonal group and its Lie algebra. Comparative experiments on various image recognition tasks demonstrate the validity of O-ViT. The experimental results show that O-ViT can boost the performance of ViT by up to 6.4%.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Supplementary Material: zip
5 Replies

Loading