Abstract: The typical selective state-space model (SSM) of Mamba addresses several limitations of Transformers, such as quadratic computational complexity with sequence length and significant inference-time memory requirements due to the key-value cache. However, the growing size of Mamba models continues to pose training and deployment challenges and raises concerns due to considerable training and inference compute consumption. In this work, we introduce Bi-Mamba, a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models with multiple sizes across 780M, 1.3B, and 2.7B. Bi-Mamba models are trained from scratch on data volume as regular LLM pertaining using an autoregressive distillation loss. Extensive experimental results on language modeling demonstrate that Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16) and much better accuracy than post-training-binarization (PTB) Mamba and binarization-aware training (BAT) Transformer baselines, while significantly reducing memory footprint and computational consumption compared to the original Mamba model. Our study pioneers a new linear computational complexity LLM framework under low-bit representation and facilitates future design of specialized hardware tailored for efficient 1-bit Mamba-based LLMs. Our code is provided in Supplementary Material and the pre-trained weights are available anonymously at https://drive.google.com/drive/folders/1jfk_TlDzFbER84ITvU2hOX2VyPC9H4MA?usp=sharing
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=HxU0wSMZ0n
Changes Since Last Submission: We have carefully addressed all concerns in our revised manuscript. These include:
- *Performance comparisons with stronger baselines* such as SqueezeLLM and AQLM, as shown in Table 3.
- *Additional results on more downstream tasks*, including HumanEval and GSM8K, as shown in Table 4.
- *Analysis of actual speedup, memory, and energy consumption*, presented in Table 8 and Section 5.5.
- *Complexity Analysis* Section added in method.
- *Fairer experimental comparisons with FBI-LLM*, included in Table 3.
- *Training cost analysis of Bi-Mamba objectives*, provided in Table 6.
- *Weight distribution visualization for different training objectives*, depicted in Figure 7.
- Clarification of *data types used during training, inference, and storage of Bi-Mamba*, elaborated in Section 5.4.
Assigned Action Editor: ~Vladimir_Braverman1
Submission Number: 4910
Loading