Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training

Zizheng Huang; Haoxing Chen; Jiaqi Li; jun lan; Huijia Zhu; Weiqiang Wang; Limin Wang

Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training

Zizheng Huang, Haoxing Chen, Jiaqi Li, jun lan, Huijia Zhu, Weiqiang Wang, Limin Wang

23 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Mamba, Supervised Learning, Training Regularization, Computer Vision

TL;DR: We propose a layer-wise stochastic regularization method that enables effective scaling of vanilla Vision Mamba model training to larger sizes, achieving SOTA performance on vision tasks compared to various architectures.

Abstract: Recent Vision Mamba models not only have much lower complexity for processing higher resolution images and longer videos but also the competitive performance with Vision Transformers (ViTs). However, they are stuck into overfitting and thus only present up to base size (about 80M). It is still unclear how vanilla Vision Mamba (Vim) can be efficiently scaled up to larger sizes, which is essentially for further exploitation. In this paper, we propose a stochastic layer-wise shuffle regularization, which empowers successfully scaling non-hierarchical Vision Mamba to a large size (about 300M) in a supervised setting. Specifically, our base and large-scale ShuffleMamba models can outperform the supervised ViTs of similar size by 0.8% and 1.0% classification accuracy on ImageNet1k, respectively, without auxiliary data. When evaluated on the ADE20K semantic segmentation and COCO detection tasks, our ShuffleMamba models also show significant improvements. Without bells and whistles, the stochastic layer-wise shuffle has the following highlights: (1) Plug and play: it does not change model architectures and will be omitted in inference. (2) Simple but effective: it can improve the overfitting in Vim training and only introduce random token permutation operations. (3) Intuitive: the token sequences in deeper layers are more likely to be shuffled as they are expected to be more semantic and less sensitive to patch positions.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3011

Loading