BlockMamba: Efficient Scalable Structured Sparsity for Mamba
Keywords: sparsity, block sparsity, mamba, state-space models, neuromorphic
Abstract: State Space Models, particularly Mamba, have shown remarkable capabilities for
efficiently scalable language modeling. To improve their efficiency for training
and inference, we study the effect of introducing parameter sparsity and quanti-
zation in Mamba. Specifically, we introduce BlockMamba, a simple method to
efficiently train sparse Mamba language models using block diagonal sparsity for
the MLPs. Additionally, we also explore the effect of Power-of-Two quantization
on model performance. Our results show that BlockMamba at 60-75% sparsity
trains 35-40% faster than equivalently sized dense Mamba, with minimal loss in
language modeling performance on GPUs. We also demonstrate that our method
yields 15-20% inference speedup compared to dense on GPUs. We scale our
approach by sparsifying a 1.4B parameter model, and demonstrate competitive
language modeling performance and strong length extrapolation capability. We
chose this combination of block sparsity and power-of-two quantization because
they are also specifically advantageous for efficiency on alternative hardware such
as neuromorphic or edge hardware. We provide a brief analysis with regard to efficiency gains on a neuromorphic hardware platform - the SpiNNaker2 system, to highlight BlockMamba’s potential. Due to the low power characteristics of neuromorphic hardware, they can serve as mobile computing platforms for robotics and automotive applications, which makes them a useful target platform for such
world models.
Submission Number: 61
Loading