StructViT: Learning Correlation Structures for Vision Transformers

Manjin Kim; Paul Hongsuck Seo; Cordelia Schmid; Minsu Cho

StructViT: Learning Correlation Structures for Vision Transformers

Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: transformer, self-attention, image classification, video classification, correlation

TL;DR: We introduce structural self-attention (StructSA) that exploits geometric structures of query-key correlations and the proposed network StructViT achieves state-of-the-art results on various image and video classification benchmarks.

Abstract: We introduce the structural self-attention (StructSA) mechanism that leverages structural patterns of query-key correlation for visual representation learning. StructSA generates attention by recognizing space-time structures of correlations and performs long-range interactions across entire locations, effectively capturing structural patterns, e.g., spatial layouts, motion, or inter-object relations in images and videos. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

4 Replies

Loading