Keywords: State Space Model, Mamba, Token Reduction, Efficient AI
Abstract: Mamba has proven efficient for long-sequence modeling in vision tasks. However, when token reduction techniques are applied to improve efficiency, Mamba-based models exhibit drastic performance degradation compared to Vision Transformers (ViTs).  This decline is potentially attributed to Mamba's chain-like scanning mechanism, which we hypothesize not only induces cascading losses in token connectivity but also limits the diversity of spatial receptive fields.
In this paper, we propose Asymmetric Multi-scale Vision Mamba (AMVim), a novel architecture designed to enhance pruning robustness. AMVim employs a dual-path structure, integrating a window-aware scanning mechanism into one path while retaining sequential scanning in the other. This asymmetry design promotes token connection diversity and enables multi-scale information flow, reinforcing spatial awareness.
Empirical results demonstrate that AMVim achieves state-of-the-art pruning robustness. During token reduction, AMVim-T achieves a substantial 34\% improvement in training-free accuracy with identical model sizes and FLOPs. Meanwhile, AMVim-S exhibits only a 1.5\% accuracy drop, performing comparably to ViT. Notably, AMVim also delivers superior performance during pruning-free settings, further validating its architectural advantages.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 14467
Loading