BlockMamba: Efficient Scalable Structured Sparsity for Mamba

Published: 02 Mar 2026, Last Modified: 12 Mar 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: sparsity, block sparsity, mamba, state-space models, neuromorphic
Abstract: State Space Models, particularly Mamba, have shown remarkable capabilities for efficiently scalable language modeling. To improve their efficiency for training and inference, we study the effect of introducing parameter sparsity and quanti- zation in Mamba. Specifically, we introduce BlockMamba, a simple method to efficiently train sparse Mamba language models using block diagonal sparsity for the MLPs. Additionally, we also explore the effect of Power-of-Two quantization on model performance. Our results show that BlockMamba at 60-75% sparsity trains 35-40% faster than equivalently sized dense Mamba, with minimal loss in language modeling performance on GPUs. We also demonstrate that our method yields 15-20% inference speedup compared to dense on GPUs. We scale our approach by sparsifying a 1.4B parameter model, and demonstrate competitive language modeling performance and strong length extrapolation capability. We chose this combination of block sparsity and power-of-two quantization because they are also specifically advantageous for efficiency on alternative hardware such as neuromorphic or edge hardware. We provide a brief analysis with regard to efficiency gains on a neuromorphic hardware platform - the SpiNNaker2 system, to highlight BlockMamba’s potential. Due to the low power characteristics of neuromorphic hardware, they can serve as mobile computing platforms for robotics and automotive applications, which makes them a useful target platform for such world models.
Submission Number: 61
Loading