Q-Mamba: Towards more efficient Mamba models via Post-Training Quantization

Chen Tianqi; Yuanteng Chen; Weixiang Xu; Zeyu Zhu; Peisong Wang; Jian Cheng

Q-Mamba: Towards more efficient Mamba models via Post-Training Quantization

Chen Tianqi, Yuanteng Chen, Weixiang Xu, Zeyu Zhu, Peisong Wang, Jian Cheng

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mamba, Quantization

Abstract: State Space Models (SSMs), such as Mamba, have recently demonstrated the potential to match or even surpass Transformers in language understanding tasks, making them a promising alternative for designing Large Language Models (LLMs). Concurrently, model quantization, especially Post-Training Quantization (PTQ), has been proven effective in reducing memory usage and inference latency in LLMs. In this paper, we explore post-training quantization for Mamba (\textbf{Q-Mamba}) by turning both linear projections and state caches into low-bit integers for efficient inference. After a theoretical analysis of the causes of outliers in states, we propose \textbf{Decoupled Scale Quantization (DSQ)}, which mitigates outliers in both the state and channel dimensions by applying separate quantization scales. To preserve the selective ability of quantized Mamba, we introduce \textbf{Efficient Selectivity Reconstruction (ESR)}, a block-wise reconstruction method that involves a novel quantization simulation scheme, enabling fast parallel scan algorithms with the non-linear quantization function. We demonstrate the effectiveness of Q-Mamba across various quantization settings, model sizes, and both generation and zero-shot tasks. In particular, for Mamba2-2.7B with W8A8H4 quantization, Q-Mamba achieves a 50\% reduction in memory consumption with only a 2.13\% average accuracy degradation on zero-shot tasks.

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2045

Loading