Understanding MLP-Mixer as a wide and sparse MLP

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: MLP-Mixer, structured weight matrix, wide neural network, Kronecker product
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: MLP-Mixer effectively behaves as a wide MLP with sparse weights
Abstract: Multi-layer perceptron (MLP) is a fundamental component of deep learning that has been extensively employed for various problems. However, recent empirical successes in MLP-based architectures, particularly the progress of the MLP-Mixer, suggest that our understanding of how MLPs achieve better performance remains limited and there is an underlying mechanism. In this research, we reveal that the MLP-Mixer effectively behaves as a wide MLP with sparse weights. Initially, we clarify that the mixing layer of the Mixer has an effective expression as a wider MLP whose weights are sparse and represented by the Kronecker product. It is also regarded as an approximation of Monarch matrices. Next, we confirmed similarities between the mixer and the unstructured sparse-weight MLP in hidden features and performance when adjusting sparsity and width. To verify the similarity in much wider cases, we introduced the RP-Mixer, a more memory-efficient alternative to the unstructured sparse-weight MLP. Then we verified similar tendencies between the MLP-Mixer and the RP-Mixer, confirming that the MLP-Mixer behaves as a sparse and wide MLP, and that its better performance is from its extreme wideness. Notably, when the number of connections is fixed and the width of hidden layers is increased, sparsity rises, leading to improved performance, consistent with the hypothesis by Golubeva, Neyshabur and Gur-Ari (2021). Particularly, maximizing the width enables us to quantitatively determine the optimal mixing layer's size.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1873
Loading