Keywords: LSTM, LLM, Language Modeling, NLP
TL;DR: We extend the LSTM architecture with exponential gating and new memory structures and show that this new xLSTM performs favorably on large-scale language modeling tasks.
Abstract: In the 1990s, the constant error carousel and gating were introduced
as the central ideas of the Long Short-Term Memory (LSTM).
Since then, LSTMs have stood the test of time
and contributed to numerous deep learning success stories,
in particular they constituted the first Large Language Models (LLMs).
However, the advent of the Transformer technology with
parallelizable self-attention at its core
marked the dawn of a new era, outpacing LSTMs at scale.
We now raise a simple question:
How far do we get in language modeling
when scaling LSTMs to billions of parameters,
leveraging the latest techniques from modern LLMs,
but mitigating known limitations of LSTMs?
Firstly, we introduce exponential gating
with appropriate normalization and stabilization techniques.
Secondly, we modify the LSTM memory structure, obtaining:
(i) sLSTM with a scalar memory, a scalar update, and new memory mixing,
(ii) mLSTM that is fully parallelizable
with a matrix memory and a covariance update rule.
Integrating these LSTM extensions into residual block backbones
yields xLSTM blocks that are then residually stacked into xLSTM architectures.
Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably
when compared to state-of-the-art Transformers and
State Space Models, both in performance and scaling.
Submission Number: 9
Loading