Understanding and Improving Length Generalization in Recurrent Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths---i.e. they fail to length generalize. In this work, we provide comprehensive empirical and theoretical analysis to support the \textit{unexplored states hypothesis}, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all \textit{attainable} states (i.e. states that would be attained if the recurrence was applied to long sequences). Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ($\sim 0.1\%$ of the pre-training budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. $2k\longrightarrow 128k$) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models.
Lay Summary: Recently, recurrent models have emerged as alternative deep learning architectures with strong performance in many areas such as text, vision, and audio. Their main advantage over the widely used Transformers is their ability to process long sequences more efficiently. However, in practice, their performance on long sequences is as poor as that of Transformers, so this remains an unrealized potential. In this work, we study why recurrent models, despite theoretical ability to process arbitrarily long sequences, fail to achieve good performance. Additionally, we propose simple and inexpensive interventions that enable good performance on long sequences, allowing them to process sequences more than 64 times longer than before and thus realize their advantage over Transformers. Finally, we show that these interventions also lead to improved performance on complex tasks that require long-context reasoning, such as answering a question whose answer is hidden in a very long context.
Primary Area: Deep Learning->Sequential Models, Time series
Keywords: length generalization, length extrapolation, mamba, long context, language models
Submission Number: 9296
Loading