Token Dynamics on Spheres in Mamba Models

Trinh Tien Nguyen; Minh-Khoi Nguyen-Nhat; Duy-Tung Pham; Hoang-Son Do; Tan Minh Nguyen; Thieu Vo

Token Dynamics on Spheres in Mamba Models

Trinh Tien Nguyen, Minh-Khoi Nguyen-Nhat, Duy-Tung Pham, Hoang-Son Do, Tan Minh Nguyen, Thieu Vo

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: selective state space model, token dynamics, continuous-time limit, dynamical system

TL;DR: We investigate the dynamical properties of tokens in deep Mamba models, extend our analysis to autoregressive sequence models, and propose principled strategies for enhancing model performance.

Abstract: The dynamical properties of tokens in internal representations, or token dynamics, of a deep learning model has recently attracted considerable attention in deep learning theory. While transformer dynamics have been extensively studied, the analysis of token dynamics in selective state space (Mamba) models remains largely unexplored. Existing studies on Mamba models impose restrictive assumptions, such as relying solely on state space layers and limiting token embeddings to one dimension. However, practical implementations of Mamba incorporate layer normalization and operate in high dimensions, implying that token dynamics evolve on a high-dimensional unit sphere. In this work, we address this gap by formulating deep Mamba models as flow maps on high-dimensional unit spheres and providing a comprehensive theoretical analysis of their token dynamics. We characterize all possible token limit points and establish explicit exponential convergence rates toward these points. Our analysis reveals that the first token’s limit point exerts an attracting effect on other tokens, leading to clustering phenomena. Furthermore, we extend our theoretical analysis of the attracting effect to a broader class of autoregressive sequence models, including state space models, causal Transformers, and classical time-series models. Leveraging this insight, we propose two applications: (i) randomly reordering tokens during training to diversify clustering on a small subset of limit points, thereby improving model performance, and (ii) explaining the attention sink effect in Mamba models through the attraction of the first token. Experimental results confirm our theoretical findings and demonstrate the practical benefits of these refinements, offering new perspectives on enhancing the effectiveness of Mamba models.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4354

Loading