MambaExtend: A Training-Free Approach to Improve Long Context Extension of Mamba

Seyedarmin Azizi; Souvik Kundu; Mohammad Erfan Sadeghi; Massoud Pedram

MambaExtend: A Training-Free Approach to Improve Long Context Extension of Mamba

Seyedarmin Azizi, Souvik Kundu, Mohammad Erfan Sadeghi, Massoud Pedram

Published: 22 Jan 2025, Last Modified: 08 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mamba, Long Context Generalization, Discretization Step, SSM

Abstract: The inherent quadratic complexity of the attention mechanism in transformer models has driven the research community to explore alternative architectures with sub-quadratic complexity, such as state-space models. Mamba has established itself as a leading model within this emerging paradigm, achieving state-of-the-art results in various language modeling benchmarks. However, despite its impressive performance, Mamba's effectiveness is limited by its pre-training context length, resulting in a pronounced degradation when the model is tasked with handling longer contexts. Our investigation reveals that Mamba's inability to generalize effectively to long contexts is primarily due to the out-of-distribution (OOD) discretization steps. To address this critical limitation, we introduce _**MambaExtend**_, a novel framework designed to significantly enhance the context extension capabilities of Mamba. Specifically, MambaExtend leverages a _**training-free**_ approach to calibrate _only_ the scaling factors of discretization modules for different layers. We demonstrate both gradient-based and gradient-free zeroth-order optimization to learn the optimal scaling factors for each Mamba layer, requiring orders of magnitude fewer updates as opposed to the parameter fine-tuning-based alternatives. Using this approach, we achieve a training-free context extension of up to 32x, expanding the context from 2k to 64k tokens with minimal increases in perplexity. In contrast to existing fine-tuning methods, MambaExtend selectively calibrates the scaling factors, requiring up to $\mathbf{5.42 * 10^6} \times$ fewer parameter updates and incurring up to $\mathbf{3.87} \times$ lower peak memory usage, while delivering comparable or superior long-context performance across multiple tasks. Codes and checkpoints are available here$^1$.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9212

Loading