Layer-Wise Analysis in Exploring the Normalization Strategies in Mamba

Peilin Feng; Yuanshuai Wang; Yunhao Ni; Junle Wang; zhangkui; wenjun wu; Lei Huang

Layer-Wise Analysis in Exploring the Normalization Strategies in Mamba

Peilin Feng, Yuanshuai Wang, Yunhao Ni, Junle Wang, zhangkui, wenjun wu, Lei Huang

01 Sept 2025 (modified: 21 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mamba, Normalization, Stability, Optimization, Scale invariance, Condition number

TL;DR: Too Long; Didn't Read

Abstract: The Mamba architecture achieves linear time and memory complexity in long-sequence modeling and vision tasks through a dynamic, input-conditioned state transition mechanism and hardware-efficient scan operations. However, as network depth increases, the state space model (SSM) component tends to amplify activation magnitudes during the forward pass, often leading to gradient explosion. This highlights the urgent need for a systematic normalization design to balance training stability and convergence speed. To address this, we analyze training stability by tracking (i) the spectral norm of the output projection weights and (ii) the largest eigenvalue of the joint input-output covariance matrix, demonstrating the effectiveness of Norm2 (post-SSM) in suppressing activation and gradient scale inflation. From an optimization efficiency perspective, we use K-FAC to pproximate the Fisher Information Matrix and show that Norm1 (pre-SSM) significantly reduces the condition number of per-layer gradients, thereby accelerating convergence. Furthermore, we propose a composite normalization strategy (BN→SSM→LN), combining BatchNorm at the input and LayerNorm at the output of SSM. We evaluate this strategy across a broad range of benchmarks. Experimental results demonstrate that the composite scheme consistently outperforms single or no normalization in both convergence speed and final accuracy. We hope this work provides both theoretical insights and empirical guidance for normalization in designing SSM-based models.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 609

Loading