TL;DR: A plug-and-play training regularization for improving non-hierarchical Vision Mamba training and will be deactivated in inference.
Abstract: Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequence length, making them highly attractive for processing visual data. However, the training methodologies and their potential are still not sufficiently explored. In this paper, we investigate strategies for Vim and propose Stochastic Layer-Wise Shuffle (SLWS), a novel regularization method that can effectively improve the Vim training. Without architectural modifications, this approach enables the non-hierarchical Vim to get leading performance on ImageNet-1K compared with the similar type counterparts. Our method operates through four simple steps per layer: probability allocation to assign layer-dependent shuffle rates, operation sampling via Bernoulli trials, sequence shuffling of input tokens, and order restoration of outputs. SLWS distinguishes itself through three principles: \textit{(1) Plug-and-play:} No architectural modifications are needed, and it is deactivated during inference. \textit{(2) Simple but effective:} The four-step process introduces only random permutations and negligible overhead. \textit{(3) Intuitive design:} Shuffling probabilities grow linearly with layer depth, aligning with the hierarchical semantic abstraction in vision models. Our work underscores the importance of tailored training strategies for Vim models and provides a helpful way to explore their scalability. Code and models are available at https://github.com/huangzizheng01/ShuffleMamba
Lay Summary: Big pictures slow most AI models, because their running time grows faster than the image. Vision Mamba stays fast because time grows almost linearly with image size. Yet it has some issues when trained to larger versions. We improve this with a training method called Stochastic Layer-Wise Shuffle. This costs nothing at test time and needs no changes to the network.
Primary Area: Deep Learning->Other Representation Learning
Keywords: Vision Mamba, Image Modeling, Training Regularization, Pre-training
Submission Number: 6750
Loading