Improving MLP Module in Vision Transformer

Yixing Xu; Chao Li; Dong Li; Xiao Sheng; Fan Jiang; Lu Tian; Ashish Sirasao

Improving MLP Module in Vision Transformer

Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, Ashish Sirasao

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: vision transformer, MLP, efficient model design

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We propose a novel MLP module for Vision Transformers, aiming to decrease FLOPs and parameters while maintaining classification accuracy.

Abstract: Transformer models have been gaining substantial interest in the field of computer vision tasks nowadays. Although a vision transformer contains two important components which are self-attention module and multi-layer perceptron (MLP) module, the majority of research tends to concentrate on modifying the former while leaving the latter in its original form. In this paper, we focus on improving the MLP module within the vision transformer. Through theoretical analysis, we demonstrate that the effect of the MLP module primarily lies in providing non-linearity, whose degree corresponds to the hidden dimensions. Thus, the computational cost of the MLP module can be reduced by enhancing the degree of non-linearity in the nonlinear function. Leveraging this insight, we propose an improved MLP (IMLP) module for vision transformers which involves the usage of the arbitrary GeLU (AGeLU) function and integrating multiple instances of it to augment non-linearity so that the number of hidden dimensions can be effectively reduced. Besides, a spatial enhancement part is involved to further enrich the non-linearity in the proposed IMLP module. Experimental results show that we can apply our method to a wide range of state-of-the-art vision transformer models irrespective of how they modify their self-attention part and the overall architecture, and reduce FLOPs and parameters without compromising classification accuracy on the ImageNet dataset.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4940

Loading