Q-Mamba: Towards more efficient Mamba models via Post-Training Quantization

ICLR 2025 Conference Submission2045 Authors

20 Sept 2024 (modified: 26 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mamba, Quantization
Abstract: State Space Models (SSMs), such as Mamba, have recently demonstrated the potential to match or even surpass Transformers in language understanding tasks, making them a promising alternative for designing Large Language Models (LLMs). Concurrently, model quantization, especially Post-Training Quantization (PTQ), has been proven effective in reducing memory usage and inference latency in LLMs. In this paper, we explore post-training quantization for Mamba (\textbf{Q-Mamba}) by turning both linear projections and state caches into low-bit integers for efficient inference. After a theoretical analysis of the causes of outliers in states, we propose \textbf{Decoupled Scale Quantization (DSQ)}, which mitigates outliers in both the state and channel dimensions by applying separate quantization scales. To preserve the selective ability of quantized Mamba, we introduce \textbf{Efficient Selectivity Reconstruction (ESR)}, a block-wise reconstruction method that involves a novel quantization simulation scheme, enabling fast parallel scan algorithms with the non-linear quantization function. We demonstrate the effectiveness of Q-Mamba across various quantization settings, model sizes, and both generation and zero-shot tasks. In particular, for Mamba2-2.7B with W8A8H4 quantization, Q-Mamba achieves a 50\% reduction in memory consumption with only a 2.13\% average accuracy degradation on zero-shot tasks.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2045
Loading