GSWIN: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Mocho Go, Hideyuki Tachibana

Published: 01 Jan 2023, Last Modified: 09 Aug 2024ICASSP 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Following the success in language domain, the self-attention mechanism (Transformer) has been adopted in the vision domain and achieving great success recently. Additionally, the use of multi-layer perceptron (MLP) is also explored in the vision domain as another stream. These architectures have been attracting attention recently to alternate the traditional CNNs, and many Vision Transformers and Vision MLPs have been proposed. By fusing the above two streams, this paper proposes gSwin, a novel vision model which can consider spatial hierarchy and locality due to its network structure similar to the Swin Transformer, and is parameter efficient due to its gated MLP-based architecture. It is experimentally confirmed that the gSwin can achieve better accuracy than Swin Transformer on three common tasks of vision with smaller model size.