Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training

Zizheng Huang; Haoxing Chen; Jiaqi Li; jun lan; Huijia Zhu; Weiqiang Wang; Limin Wang

Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training

Zizheng Huang, Haoxing Chen, Jiaqi Li, jun lan, Huijia Zhu, Weiqiang Wang, Limin Wang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0

TL;DR: A plug-and-play training regularization for improving non-hierarchical Vision Mamba training and will be deactivated in inference.

Abstract: Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequence length, making them highly attractive for processing visual data. However, the training methodologies and their potential are still not sufficiently explored. In this paper, we investigate strategies for Vim and propose Stochastic Layer-Wise Shuffle (SLWS), a novel regularization method that can effectively improve the Vim training. Without architectural modifications, this approach enables the non-hierarchical Vim to get leading performance on ImageNet-1K compared with the similar type counterparts. Our method operates through four simple steps per layer: probability allocation to assign layer-dependent shuffle rates, operation sampling via Bernoulli trials, sequence shuffling of input tokens, and order restoration of outputs. SLWS distinguishes itself through three principles: \textit{(1) Plug-and-play:} No architectural modifications are needed, and it is deactivated during inference. \textit{(2) Simple but effective:} The four-step process introduces only random permutations and negligible overhead. \textit{(3) Intuitive design:} Shuffling probabilities grow linearly with layer depth, aligning with the hierarchical semantic abstraction in vision models. Our work underscores the importance of tailored training strategies for Vim models and provides a helpful way to explore their scalability. Code and models are available at https://github.com/huangzizheng01/ShuffleMamba

Lay Summary: Big pictures slow most AI models, because their running time grows faster than the image. Vision Mamba stays fast because time grows almost linearly with image size. Yet it has some issues when trained to larger versions. We improve this with a training method called Stochastic Layer-Wise Shuffle. This costs nothing at test time and needs no changes to the network.

Primary Area: Deep Learning->Other Representation Learning

Keywords: Vision Mamba, Image Modeling, Training Regularization, Pre-training

Submission Number: 6750

Loading