Architecturally Aligned Comparisons Between ConvNets And Vision Mambas

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Architecturally aligned comparisons, Vision Mambas, ConvNets, Vision Transformers
TL;DR: We conduct architecturally aligned comparisons between ConvNets and Vision Mambas, providing credible evidence for the necessity of introducing Mamba to vision.
Abstract: Mamba, an architecture with token mixers of state space models (SSM), has been recently introduced to vision tasks to tackle the quadratic complexity of self-attention. However, since SSM's memory is inherently lossy and precedent vision mambas struggle to compete with advanced ConvNets or ViTs, it is unclear whether Mamba has contributed new advances to vision. In this work, we carefully align the macro architecture to facilitate direct comparisons of token mixers which are the core contribution of Mamba. Specifically, we construct a series of Gated ConvNets (GConvNets) and compare VMamba's token mixers with gated 7$\times$7 depth-wise convolutions. The empirical results clearly demonstrate the superiority of VMamba's token mixers in both image classification and object detection tasks. Therefore, it is not useless to introduce SSM for image classification on ImageNet. Furthermore, we compare two types of token mixers within hybrid architectures that incorporate a few self-attention layers in the top blocks. The results demonstrate that both VMambas and GConvNets benefit from incorporating self-attention and we still need Mamba in this case. Interestingly, we find that incorporating self-attention layers has opposite effects on them, mitigating the over-fitting in VMambas while enhancing the fitting ability of GConvNets. Finally, we assess natural robustness of pure and hybrid models in image classification, revealing stronger robustness of VMambas and hybrid models. Our work provides credible evidence for the necessity of introducing Mamba to vision and shows the significance of architecturally aligned comparisons for evaluating different token mixers in sophisticated hierarchical models.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6240
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview