
\section{Introduction}
Recent advances in state space model(SSM) have been successful in learning long sequence relationships via mitigating the computational inefficiency of explicitly modeling token interactions\citep{gu2023mamba,gu2021efficiently,gu2020hippo,gu2021combining,smith2022simplified}. It achieves significantly better performance compared with attention-based transformers in the long range arena (LRA) dataset \citep{tay2020long}. 
The linear time-invariant structure of SSM allows for an asymptotic computational complexity of only $O(T\log T)$, which is significantly better than the $O(T^{2})$ complexity of traditional full-attention approaches exhibiting significant computational demands \citep{vaswani2017attention}. Moreover, SSMs have proven effective in multiple domains dealing with continuous signal data, including audio and vision tasks \citep{li2024videomamba,goel2022s,nguyen2022s4nd}.

Despite its success across various fields of practical application, many theoretical questions remain unanswered. One of the biggest problems is the expressivity of deep state-space models. Although modern SSMs \citep{smith2022simplified,gu2023mamba} comprise dozens of layers and contain millions of parameters, it is still unclear what the real difference is between shallow SSMs and deep SSMs, and the reason why we need depth in state-space models. Another notable problem is what kind of sequence to sequence relationship can be handled better with a deep network than with a shallow network. 


% Motivated by norm-based implicit regularization \citep{arora2019implicit} and techniques such as gradient norm clipping\citep{pascanu2013difficulty}, we investigate how the depth of networks affects the parameter norm from the perspective of SSM expressivity. 

% For some oscillators, such as the shifted delta function, using a linear combination of exponential functions may result in large coefficients, but expressing them with a deep SSM may reduce the required coefficient magnitudes.

% Some recent works have investigated the role of depth in deep SSMs from the perspectives of dynamical systems or optimal transport . However, these analyses typically consider depth as a continuous variable under asymptotic assumptions, which clearly do not align with practical settings.


%clustering bevior diverge or not, but not focus on expressivity. you have to frame as much possible as other people's work in the positive, but not negative. supplement others' work not have done
% A number of recent works have employed dynamical systems approaches to study token stability and feature clustering behaviors in deep models \citep{geshkovski2023emergence, vo2024demystifying}. 


A number of recent works have employed dynamical systems approaches to study token stability in deep SSMs\citep{vo2024demystifying}. \cite{geshkovski2023emergence} uses a similar dynamics-based formulation to investigate feature clustering behaviors in transformers. Other studies focus on the training dynamics of deep linear networks from an optimization perspective \citep{menon2024geometry}. However, relatively little work has been devoted to understanding how depth affects the expressivity of deep SSMs. In this paper, we adopt a simple formulation that allows us to characterize important differences between multi-layer SSM and one-layer SSM. Our main contributions are presented as follows:
% In \cref{theorem 4.1}, we prove that, in the absence of norm constraints, depth and width are equivalent in terms of expressive capacity for deep linear SSMs. Moreover, we establish the minimal width required to achieve this expressivity.
\begin{itemize}
    \item In \cref{theorem 4.1}, we prove that, in the absence of norm constraints, depth and width are equivalent in approximation in the sense that under a fixed parameter budget, models of arbitrary depth achieve the same expressive power.
    \item In \cref{theorem 4.2}, we demonstrate that a shallow SSM with large-norm weights can be exactly represented by a deep SSM with smaller-norm weights, under explicit norm constraints on the model parameters.
    \item We establish upper bounds on the minimal depth required for a deep linear SSM to represent a given shallow linear SSM in the regime of constrained parameter norms in \cref{theorem 4.3} and validate our theoretical results with numerical experiments.
\end{itemize}

% We show that shallow SSM with big norm could be represented by deep SSM with smaller norm within the framework of a min-max problem and provide the minimal required depth for this representation under parameter norm constraints.





% prove precise equivalence because previous work tell us a special case can be represent, in the general, can be represent. a width mn depth b linear SSM can be represented by .... Conversely, ... without any constraints

% however, with the constraint on the norm of weights of SSM 
% shallow SSM whose weights have large norms can be represented by a deep SSM with a smaller norm

% refer to which theorem



% model with norm/SSM with norm
\section{Related works}

\paragraph{Expressivity of State-Space Models} State-space models originate from the HIPPO matrix, which is optimal in the online function approximation sense \citep{smith2022simplified,gu2020hippo}. \citet{wang2023state} also provides theoretical guarantees for the approximation of continuous sequence-to-sequence mappings using SSMs with layer-wise nonlinear activations. Furthermore, \cite{muca2024theoretical} employed tools from Rough Path Theory to show that deep diagonal SSMs possess less expressive power than their non-diagonal counterparts. 
However, these works primarily focus on function approximation of deep SSMs, whereas the present work aims to present the simplest setting where one can study how depth influences the expressivity in deep SSMs.

% However, these works primarily focus on function approximation and lack a systematic analysis of how model depth influences the expressive capacity of deep SSMs.


\paragraph{Dynamics in deep State-Space Models} Several works focus on dynamics in deep SSMs. \cite{vo2024demystifying} investigates the divergence behavior of tokens in a pre-trained Mamba model by characterizing continuous-time systems. \cite{smekal2024interplay} shows how the memory of a deep linear SSM varies with the depth and width, and how the learning dynamics of deep linear models vary with memory expressivity. Our work focuses on a different setting and complements existing analyses by providing a theoretical understanding of depth in deep linear SSMs.

% introduction : investigate control viewpoint
% different setting, not token dynamics but representability

% why the work is different but not others' work good or not.

\paragraph{Deep linear networks} Simple linear architectures often serve as effective tools for gaining theoretical insights into the behavior of deep neural networks. \cite{arora2018convergence} proves that the convergence of gradient descent achieves a linear rate for training a deep linear neural network over whitened data under certain conditions. \cite{bah2022learning} also shows that optimizing a deep linear network is equivalent to Riemannian gradient flow on a manifold of low-rank matrices, with a suitable Riemannian metric. \citet{gruber2024role} study the implicit bias arising from weight initialization in deep linear networks, offering the insight that weight norms are central to understanding the behavior of deep models. Following prior theoretical analysis of deep linear networks, our work focuses on understanding the expressivity of deep recurrent models, specifically deep state-space models.


% Inspired by prior theoretical analysis of deep linear networks\citep{saxe2019mathematical,saxe2013exact}, our work focuses on understanding the expressivity of deep recurrent models, specifically deep state-space models.



% Reducing the norm of parameters in machine learning, such as $L_{1}$ and $L_{2}$ regularization [], is often associated with improved robustness, and better performance. Specifically, controlling the norm of recurrent weight matrices using techniques like gradient clipping[] or spectral normalization[] is crucial for mitigating the exploding gradient problem[] in recurrent neural network. However, there has been limited investigation into how parameter norms affect deep sequential models and how their impact interacts with model depth until now.


% deep linear model application on recurrent deep model

% \paragraph{norm base implicit bias}
% norm optimization, expressivity norm constraint different, evidence shows trained model prefer small norm base implicit regularization


% support the idea we prefer norm is small   usual case: we applied gradient clipping or norm clipping,   depth and shallow are different.

