In-Context Stochastic Gradient Descent with Hybrid Mamba-2 and Linear Self-Attention Model

In-Context Stochastic Gradient Descent with Hybrid Mamba-2 and Linear Self-Attention Model

ICLR 2026 Conference Submission19444 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mamba-2, Stochastic Gradient Descent, In-context Learning

Abstract: State space models (SSMs) have gained popularity as an alternative to Transformers by mitigating the quadratic computational cost associated with self-attention. However, despite their widespread adoption, the theoretical principles underlying their ability to perform in-context learning (ICL) remain poorly understood. In this work, we theoretically analyze the widely used Mamba-2 model (Dao et al. 2024) and demonstrate that a single-layer Mamba-2 can simulate one step of gradient descent, while a hybrid architecture combining Mamba with a Transformer (Mamba $\circ$ TF) can perform mini-batch stochastic gradient descent. Our experimental results support these theoretical findings.

Primary Area: interpretability and explainable AI

Submission Number: 19444

Loading