Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling

Yingfa Chen; Xinrong Zhang; Shengding Hu; Xu Han; Zhiyuan Liu; Maosong Sun

Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling

Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: RNN, foundation models, long-context

TL;DR: Systematic study of linear diagonal RNNs' length generalization failure and memory capacity in language modeling and contextual information retrieval.

Abstract: One essential advantage of recurrent neural networks (RNNs) over transformer-based language models is their linear computational complexity concerning the sequence length, which makes them much faster in handling long sequences during inference. However, most publicly available RNNs (e.g., Mamba and RWKV) are trained on sequences with less than 10K tokens, and their effectiveness in longer contexts remains largely unsatisfying so far. In this paper, we study the cause of the inability to process long context for RNNs and suggest critical mitigations. First, we investigate *state explosion* (SE) in Mamba-2 when processing long sequences, a phenomenon where some channels of the state exhibit exploding values that cause severe performance degradation. With controlled experiments, we discover that the model fails to forget the earlier tokens when there is more information than it can remember. We attribute this to overfitting due to the recurrent state being overparameterized for the training length, thereby establishing a relationship between SE and the capacity of the state. To support this hypothesis, we make an important empirical observation: for any given state size, there exists a training length threshold such that SE is exhibited if and only if the training length is greater than this threshold. Empirically searching for this threshold for different state sizes reveals that it is a linear function of the state size. We also search for the maximum context length at which the model can recall contextual information and find that this context length scales exponentially to the state size. Based on this, we empirically train a Mamba-2 370M with near-perfect passkey retrieval accuracy on 256K context length. This suggests a promising future for RNN-based long-context modeling. Code and model checkpoints will be publicly released.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12121

Loading