Block-Biased Mamba for Long-Range Sequence Processing

Annan Yu; N. Benjamin Erichson

Block-Biased Mamba for Long-Range Sequence Processing

Annan Yu, N. Benjamin Erichson

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sequence Model, State Space Model, Mamba, Long Range Arena

TL;DR: We analyze Mamba's shortcomings in long-range sequence processing through its expressiveness, inductive bias, and training stability, and design a remedy called Block-Biased S6 to achieve state-of-the-art performance on long-range tasks.

Abstract: Mamba extends earlier state space models (SSMs) by introducing input-dependent dynamics, and has demonstrated strong empirical performance across a range of domains, including language modeling, computer vision, and foundation models. However, a surprising weakness remains: despite being built on architectures designed for long-range dependencies, Mamba performs poorly on long-range sequential tasks. Understanding and addressing this gap is important for improving Mamba's universality and versatility. In this work, we analyze Mamba’s limitations through three perspectives: expressiveness, inductive bias, and training stability. Our theoretical results show how Mamba falls short in each of these aspects compared to earlier SSMs such as S4D. To address these issues, we propose $\text{B}\_{2}\text{S}\_{6}$, a simple extension of Mamba's S6 unit that combines block-wise selective dynamics with a channel-specific bias. We prove that these changes equip the model with a better-suited inductive bias and improve its expressiveness and stability. Empirically, $\text{B}\_{2}\text{S}\_{6}$ outperforms S4 and S4D on Long-Range Arena (LRA) tasks while maintaining Mamba's performance on language modeling benchmarks.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 3606

Loading