Abstract: Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel selective scan, has recently emerged as a linearly-scaling, efficient alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this limitation by augmenting Mamba's global forward scan with a global backward scan, forming a bi-directional scan that restores a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward selective scan and executes it entirely in per-thread registers. Building on LBMamba, we present LBVim, a scalable vision backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate the versatility of our approach on both natural images and whole slide images (WSIs). We show that our LBVim constantly offers a superior performance–throughput trade-off. That is under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher AP$^b$ and 1.1% higher AP$^m$ on the COCO detection dataset. Our method serves as a general-purpose enhancement, boosting the accuracy of four SOTA Mamba models, namely VMamba, LocalVim, PlainMamba and Adventurer, by 0.5% to 3.4%. We also integrate LBMamba into the SOTA pathology multiple instance learning (MIL) approach, MambaMIL, which uses single directional scan. Experiments on 3 public WSI classification datasets show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy. Our code is available at https://github.com/cvlab-stonybrook/LBMamba.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. Strengthened Evaluation: We have expanded our experimental results to demonstrate the generalizability of our hardware-aware algorithm. We applied our method to several additional SOTA baselines (VMamba-Nano, PlainMamba-L1, and LocalVim-T), showing consistent accuracy improvements at comparable throughput levels. We also added FPS (throughput) metrics to the DeiT comparisons.
2. Clarified Contribution vs. Adventurer: We have revised the manuscript to explicitly clarify the distinction. We explain that our method (focusing on local understanding) is orthogonal to Adventurer (focusing on global understanding). To validate this, we integrated our method with Adventurer and achieved an additional 0.5% performance gain, demonstrating their complementary nature.
3. Clarified Metrics: We refined our discussion on efficiency metrics (Section 4.1), reiterating why throughput is a more representative measure of practical speed than FLOPs for our specific hardware-level optimizations.
4. Additional Revisions: We have added new visualizations (Appendix G), clarified the role of the hyper-parameter M, and corrected minor presentation issues and notations as suggested.
Code: https://github.com/cvlab-stonybrook/LBMamba
Supplementary Material: pdf
Assigned Action Editor: ~Charles_Xu1
Submission Number: 5155
Loading