xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
TL;DR: We introduce xLSTM 7B a Large Language Model based on the xLSTM architecture with targeted optimizations for fast and efficient inference.
Abstract:

Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM’s architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM’s potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source. Model: https://huggingface.co/NX-AI/xLSTM-7b Code: https://github.com/NX-AI/xlstm and https://github.com/NX-AI/xlstm-jax.

Lay Summary:

Recent Large Language Models (LLMs) are trained to think through problems before they respond. This thinking process can improve performance especially on hard reasoning problems. Since LLMs generate long texts during this thought process, the required context length becomes very large. The dominant Transformer architecture scales unfavorably quadratically in terms of compute cost with this increased context lengths. We show that the recently introduced recurrent xLSTM architecture can be adapted to build an LLM at the seven billion (7B) parameter model size. As it scales linearly in compute, and due to our speed optimization, it is now the fastest model at this scale, matching the performance of other models. This shift from quadratic Transformer to linear xLSTM models makes LLMs more memory-, compute- and energy-efficient without compromising the quality of the outputs.

Primary Area: Deep Learning->Large Language Models
Keywords: xLSTM, LLM, inference, inference time, inference speed, Transformer
Submission Number: 6752
Loading