From Markov to Laplace: How Mamba In-Context Learns Markov Chains

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

ICLR 2026 Conference Submission17277 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: State-space models, Markov chains, In-context learning, Laplacian smoothing

TL;DR: We uncover an interesting phenomenon where a single-layer Mamba represents the Bayes optimal Laplacian smoothing estimator when trained on Markov chains and we demonstrate it theoretically and empirically.

Abstract: While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed-ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering an interesting phenomenon: even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 17277

Loading