Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0
TL;DR: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Abstract: State Space Models (SSMs) are gaining attention as an efficient alternative to Transformers due to their constant memory complexity and comparable performance. Yet, deploying large-scale SSMs on cloud-based services or resource-constrained devices faces challenges. To address this, quantizing SSMs using low bit-width data types is proposed to reduce model size and leverage hardware acceleration. Given that SSMs are sensitive to quantization errors, recent advancements focus on quantizing a specific model or bit-width to improve their efficiency while maintaining performance. However, different bit-width configurations, such as W4A8 for cloud service throughput and W4A16 for improving question-answering on personal devices, are necessary for specific scenarios. To this end, we present Quamba2, compatible with \textbf{W8A8}, \textbf{W4A8}, and \textbf{W4A16} for both \textbf{Mamba} and \textbf{Mamba2}, addressing the rising demand for SSM deployment across various platforms. We propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for $x$, combined with a per-state-group quantization for $B$ and $C$. To ensure compute-invariance in the SSM output, we offline rearrange weights according to the clustering sequence. The experiments show Quamba2-8B outperforms several state-of-the-art SSMs quantization methods and delivers 1.3$\times$ and 3$\times$ speedup in the pre-filling and generation stages and 4$\times$ memory reduction with only a $1.6$% accuracy drop on average. The code and quantized models will be released at:
Lay Summary: Large AI models are powerful but often too big and slow to run efficiently on everyday devices or even in the cloud. State Space Models (SSMs) are a newer type of AI model that use memory more efficiently than the popular Transformer models, making them a promising option. However, running these models quickly and on different hardware remains a challenge. Our work introduces Quamba2, a method that makes these models smaller and faster by converting their numbers into simpler, lower-precision formats. This helps the models run better on everything from cloud servers to personal laptops, depending on the task. Quamba2 supports several precision levels, so it can balance speed and accuracy depending on where it’s used. We tested Quamba2 on large models and found it could cut memory use by up to 4× and speed up responses significantly, with only a small drop in performance. This brings us closer to making powerful AI models work smoothly across a wide range of platforms. Our code and models will be shared with the community.
Link To Code: https://github.com/enyac-group/Quamba
Primary Area: Deep Learning->Large Language Models
Keywords: State Space Models, Model quantization
Submission Number: 7204
Loading