Layer-Wise Analysis in Exploring the Normalization Strategies in Mamba

ICLR 2026 Conference Submission609 Authors

01 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mamba, Normalization, Stability, Optimization, Scale invariance, Condition number
TL;DR: Too Long; Didn't Read
Abstract: The Mamba architecture achieves linear time and memory complexity in long-sequence modeling and vision tasks through a dynamic, input-conditioned state transition mechanism and hardware-efficient scan operations. However, as network depth increases, the state space model (SSM) component tends to amplify activation magnitudes during the forward pass, often leading to gradient explosion. This highlights the urgent need for a systematic normalization design to balance training stability and convergence speed. To address this, we analyze training stability by tracking (i) the spectral norm of the output projection weights and (ii) the largest eigenvalue of the joint input-output covariance matrix, demonstrating the effectiveness of Norm2 (post-SSM) in suppressing activation and gradient scale inflation. From an optimization efficiency perspective, we use K-FAC to pproximate the Fisher Information Matrix and show that Norm1 (pre-SSM) significantly reduces the condition number of per-layer gradients, thereby accelerating convergence. Furthermore, we propose a composite normalization strategy (BN→SSM→LN), combining BatchNorm at the input and LayerNorm at the output of SSM. We evaluate this strategy across a broad range of benchmarks. Experimental results demonstrate that the composite scheme consistently outperforms single or no normalization in both convergence speed and final accuracy. We hope this work provides both theoretical insights and empirical guidance for normalization in designing SSM-based models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 609
Loading