\section{Problem Formulation}


One particularly notable feature of currently popular sequential models such as S4\citep{smith2022simplified} and Mamba\citep{gu2023mamba} is the stacking of multiple layers, which leverages scaling laws to increase the number of parameters and achieve stronger performance. In this section, we aim to provide a theoretical perspective on the architecture of deep state-space models, to characterize more precisely how depth contributes to their success.

\subsection{Deep Linear State-Space Models}
% Here, $g_{t}^{(l)}$ is the hidden state of $l$-th layer at time step $t$ and function $f$ means recurrent mapping of each layer from previous layer output $g_{t+1}^{(l-1)}$ to hidden state of present layer $g_{t+1}^{(l)}$.

% directly write (2) general form add activation. from now, sigma=identity 
% deep SSM, linear : remove sigma to 1

% We observe that both shallow and deep state-space models share a convolutional structure, differing only in their kernels when the nonlinearity in the function $f$ is removed. This observation motivates our focus on the linear setting, which serves as a natural and tractable starting for investigating the role of depth in SSMs.

For a general $l$-layer deep state-space model, we express its recurrent form as follows, where each layer's input comes from the previous layer's output:
\begin{equation}
\label{linear deep ssm}
\begin{aligned}
y(t) &= C^{T}\sigma(h_{l}(t))\\
h_{l}(t+1) &= A_{l}h_{l}(t)+B_{l}\sigma(h_{l-1}(t+1))\\
&\vdots\\
h_{2}(t+1) &= A_{2}h_{2}(t)+B_{2}\sigma(h_{1}(t+1))\\
h_{1}(t+1) &= A_{1}h_{1}(t)+B_{1}x(t+1)\\
\end{aligned}
\end{equation}
where $h_{l}(t)\in\mathbb{R}^{m}$ is the hidden state of $l$-th layer at time step $t$. $\sigma:\mathbb{R}^{m\times 1}\to\mathbb{R}^{m\times1}$ is activation function on hidden state $h_{l}(t)$. $A_{1},...,A_{l}, B_{2},...,B_{l}\in\mathbb{C}^{m\times m}$ and $C,B_{1}\in\mathbb{C}^{m\times1}$ are hidden matrices. Also, $A_{1},...,A_{l}$ are the state-space matrices, $B_{1},...,B_{l}$ are the input matrices and $C$ is the output matrix. For each scalar input $x(t)\in\mathbb{R}$, we obtain an output $y(t)\in\mathbb{R}$ by passing it through this $l$-layer model.



% convolved by kernel $\rho(t)$, i.e. $y(t)=(\rho \ast x) (t)$.

Now, we set $\sigma$ to be the identity mapping and consider a linear setting. We observe that both shallow and deep state-space models share a convolutional structure, differing only in their kernels when the nonlinearity in the function $\sigma$ is removed, which serves as a natural and tractable starting point for investigating the role of depth in SSMs.


The convolutional kernel allows us to clearly understand how the input is mapped to the corresponding output through the sequential model, i.e. $y(t)=(\rho \ast x) (t)$. We characterize the convolutional kernel of the deep state-space model as follows:

\begin{lemma} \label{lemma: convolution kernel}
     For $l$ layer deep linear SSM defined in \cref{linear deep ssm}, the convolutional kernel $\rho(t)$ admits the following representation
\begin{equation}
\rho(t)=\sum_{\substack{i_{1}+i_{2}+...+i_{l}=t\\i_{1},...,i_{l}\in\mathbb{N}}}C^{T}\prod_{j=1}^{l}(A_{l-j+1}^{i_{l-j+1}}B_{l-j+1})
\end{equation}
\end{lemma}

A proof of \cref{lemma: convolution kernel} can be found in \cref{ProofOfStructure}. Our approach to proving this result is checking the basic case and then performing induction on both timesteps $t$ and layers $l$. 

Notably, it is the simplified linear architecture of our network that enables the convolutional kernel to be expressed in a clear and tractable form, as the sum over layer indices of successive powers of the hidden matrices. This lemma reveals how, in deep SSMs, each input $x(t)$ is convolved through such a kernel to produce the output $y(t)$. This insight lays the groundwork for our subsequent comparison of the representational capacities of deep SSMs and shallow SSMs.

\subsection{Norm-Constrained Hypothesis Space for Deep Linear SSMs}
\label{space definition}
To provide a mathematical description of the function space of deep state-space models, leveraging \cref{lemma: convolution kernel}, we define the following hypothesis spaces for deep linear state-space models under norm constraints.
\begin{equation}
\label{4}
\begin{aligned}
    \mathcal{H}_{c,l}^{m}=&\{\rho(t):y(t)=(\rho \ast x) (t),A_{1},...,A_{l}\in\mathbb{C}^{m\times m}\thinspace\mbox{diagonal},B_{2},...,B_{l},\in\mathbb{C}^{m\times m},\\ &C,B_{1}\in\mathbb{C}^{m\times1},\max_{i=1,...,l}r(A_{i})<1,||C||_{\infty}\le c, ||B_{1}||_{\infty}\le c,\max_{2\le k\le l}\max_{1\le i,j\le m}|(B_{k})_{ij}|\le c\}
\end{aligned}
\end{equation}
where $r(\cdot)$ refers to spectral radius of each matrix, $||\cdot||$ means the infinity norm of a vector, $m$ is the width of network, $c$ is norm constraint for parameters and $l$ is the number of layers.

Here, we assume that each $A_i$ is diagonal, a structure commonly adopted in real-world models \citep{saon2023diagonal}. We further impose that the spectral radius of each $A_i$ is strictly less than 1 to ensure system stability. Since the implementation in \cite{smith2022simplified} is one-dimensional, we assume both the input $x(t)$ and output $y(t)$ are scalar-valued. It is a natural thing to consider norm constraints because of optimization. State-space models may suffer from large parameter norms, which may lead to training instability and negatively impact model performance \citep{pascanu2013difficulty}. Thus, the norm constraint $c$ remains an important factor in distinguishing the expressivity of deep state-space models, for the simple reason that the convolutional kernel is directly determined by the parameters $B_{i}$ and $C$. Thus, altering the norm can have a significant impact on the model’s expressivity.

%we first consider the most general hypothesis space, which consists of all SSMs where recurrent matrices Ai are diagnolizable and has negative real parts these coresspond to stable models spectral radius. next we will introduce a simplied version, a subset of these hypothesis whereAi diagonal these are also commanly used S4Dand for this , we obtain better results,. S4 implementation one dimension to one dimension


%notation: infinity norm defines this max infinity 
% We observe that the only difference between $\mathcal{H}_{c,n}^{m}$ and $\mathcal{G}_{c,n}^{m}$ which are controlled by width $m$, norm constraint $c$ and layer $n$, lies in whether hidden matrices $A_{i}$ are diagonal or diagonalizable. It is clear that $\mathcal{H}_{c,n}^{m}$ encompasses a larger space. However, the diagonal setting of $\mathcal{G}_{c,n}^{m}$ is more tractable for our subsequent analysis.

% The difference between $\mathcal{H}_{c,n}^{m}$ and $\mathcal{G}_{c,n}^{m}$ is not as substantial as one might initially expect. In fact, if we remove the norm constraint-that is, as $c$ approaches infinity-the two hypothesis spaces become identical.
% \begin{lemma}[Representability equivalence in two hypothesis space]
% \label{lemma: equaivalence}
% Given $m,n\ge1$, the we have 
% \begin{equation}
%     \mathcal{G}_{\infty,n}^{m}=\mathcal{H}_{\infty,n}^{m}
% \end{equation}
% \end{lemma}

% %proposition 3.2
% A proof of \cref{lemma: equaivalence} is provided in \cref{ProofOfStructure}. By leveraging the property of diagonalizability, we can straightforwardly map each $\rho_{t}\in\mathcal{H}_{\infty,n}^{m}$ to its equivalent representation in $\mathcal{G}_{\infty,n}^{m}$

\section{Main Results}
In this section, we present our main results on the expressivity of deep state-space models within the framework defined in \cref{space definition}.

%depth and with are equivalent when c goes to infinity
%layer and width are equivalent for norm unconstrained / when norm is unconstrained
\subsection{Equivalence of Depth and Width without Norm Constraints}
\label{subsection 4.1}
First, we focus on the hypothesis space of deep linear SSMs without norm constraints. Our goal is to characterize the fundamental differences between shallow and deep SSMs, which leads us to the following questions: given a one-layer SSM with a certain width, 
how wide must a $l$-layer SSM be in order to represent it? In previous work, \citet{smekal2024interplay} provided an example demonstrating how a four-layer linear SSM can be converted into a single-layer one. Here, we give a complete characterization.

% Given a one-layer linear SSM with width 
% $nm$, how wide must an $n$-layer SSM be in order to represent it? Conversely, given an $n$-layer linear SSM with width $m$, how wide must a one-layer linear SSM be to represent it? Based on \cref{lemma: equaivalence}, we choose to focus on hypothesis space (4) in subsequent analysis. We use the following theorem to address these interesting questions. 

%given a one-layer SSM withcertain width, how wide musy an n layer SSM be in order to represent it. in previous poaper on the interplay... they have given a version of diagonal ... here we give a complete results

\begin{theorem} 
\label{theorem 4.1}
Let $m,l\geq 1$ and $c_{1}>0$. Recall that $\mathcal{H}_{\infty,l}^{m}$ is a norm-unconstrained hypothesis space of linear SSMs with $l$ layers and $m$ width. Then, we have
\begin{equation}
\begin{aligned}
    &\mathcal{H}_{\infty,1}^{l(m-1)+1}\subseteq\mathcal{H}_{\infty,l}^{m}\subseteq\mathcal{H}_{\infty,1}^{lm}\\
    &\mathcal{H}_{\infty,1}^{l(m-1)+2}\not\subseteq\mathcal{H}_{\infty,l}^{m}
\end{aligned}
\end{equation}
\end{theorem}

%huanhuang
%proof technique : base on fourier transform/ constructive method the proof is by specific construction

%conclusion order nm






A detailed proof of \cref{theorem 4.1} based on explicit construction can be found in \cref{ProofMain}. 

On the one hand, \cref{theorem 4.1} shows that given an $l$-layer linear SSM with width $m$, we can always construct a one-layer linear SSM with width $ml$ to represent it.
% On the other hand, according to \cref{theorem 4.1}, given a one-layer linear SSM, if we wish to represent it using an $l$-layer SSM with width $m$, the maximal width is $l(m-1)+1$. 
On the other hand, the maximal width of one-layer SSM that can be represented by an $l$-layer SSM with width $m$ is $l(m-1)+1$. This implies that there exists a convolutional kernel of one-layer linear SSM with width $l(m-1)+2$ that cannot be represented by the kernel of an $l$-layer SSM with width $m$. We have constructed such a kernel to illustrate the optimality of this width bound, as detailed in \cref{ProofMain}. This result illustrates that $l$-layer SSM of width $m$ has expressivity equivalent to one-layer SSM of width $O(lm)$ under the same parameter count, highlighting that width can be traded for depth without loss of expressive power.
\par In fact, a more general result holds when the assumption that each hidden matrix $A_i$ is diagonal is relaxed to the case where $A_i$ is diagonalizable, which is a dense and open set in the matrix space $\mathbb{C}^{m\times m}$. See the appendix \cref{ProofMain} for details.

% In fact, we can directly obtain this one-layer representation from the recurrent form in \cref{linear deep ssm}:
% \begin{equation}
%     \begin{pmatrix}
%         g_{t+1}^{(n)}\\
%         g_{t+1}^{(n-1)}\\
%         \vdots\\
%         g_{t+1}^{(1)}
%     \end{pmatrix}
%     =\begin{pmatrix}
%         A_{n}&B_{n}A_{n-1}&\cdots&B_{n}B_{n-1}\cdots B_{2}A_{1}\\
%         &\ddots& &\vdots\\
%         & &\ddots&B_{2}A_{1}\\
%         &&&A_{1}
%     \end{pmatrix}
%     \begin{pmatrix}
%         g_{t}^{(n)}\\
%         g_{t}^{(n-1)}\\
%         \vdots\\
%         g_{t}^{(1)}
%     \end{pmatrix}+
%     \begin{pmatrix}
%         B_{n}B_{n-1}\cdots B_{1}\\
%         B_{n-1}\cdots B_{1}\\
%         \vdots\\
%         B_{1}
%     \end{pmatrix}x_{t}
% \end{equation}
% On the other hand, according to \cref{theorem 4.1}, given a one-layer linear SSM, if we wish to represent it using an $n$-layer SSM with width $m$, the minimal width is $n(m-1)+1$. This implies that there exists a convolutional kernel of a one-layer linear SSM with width $n(m-1)+2$ that cannot be represented by the kernel of an $n$-layer SSM with width $m$. We have constructed such a kernel to illustrate the optimality of this width bound, as detailed in the Appendix.



%width and depth are not equivalent under norm constraint
% for c infinity setting, increasing width and increasing depth are equivalent. however under norm constraint, the story changes. in practice, our optimization, our weight norm 
\subsection{Non-equivalence of Depth and Width with Norm Constraints}
\label{section 4.2}
In the absence of norm constraints, increasing width and depth are equivalent under the same parameter count (\cref{theorem 4.1}). However, under norm constraints, the effects of depth and width on the expressivity of deep linear SSMs differ significantly. Now, we focus on the hypothesis space defined in \cref{space definition} to investigate the impact of norm constraints on the expressivity of deep linear state-space models. The following theorem demonstrates how a shallow SSM with a large parameter norm can be equivalently represented by a deeper SSM composed of layers with smaller parameter norms:

% Now we return from the infinite norm setting discussed in \cref{subsection 4.1} and %for rela setting, 2/n can become 1/n real and complex setting
\begin{theorem}
\label{theorem 4.2} Suppose $m,l\ge1$ and let $c_{1}>0$ be the norm constraint for the one-layer linear SSM. Then, we have the following upper bound for the norm constraint of $l$-layer linear SSM defined in \cref{linear deep ssm}:
    \begin{equation}
    \sup_{\rho\in\mathcal{H}_{c_{1},1}^{l(m-1)+1}}\inf_{c_{2}>0}\{c_{2},\: \rho\in\mathcal{H}_{c_{2},l}^{m}\}\le2c_{1}^{\frac{2}{l+1}}
\end{equation}
\end{theorem}

A detailed proof of \cref{theorem 4.2} can be found in \cref{ProofMain}. Notably, under the same order of magnitude of parameter count, a given one-layer linear SSM with a large norm constraint $c_{1}$ can be equivalently represented by an $l$-layer SSM with width $m$ and a smaller norm constraint bounded by $2c_{1}^{\frac{2}{l+1}}$. Hence, as the number of layers increases, the corresponding norm of weights decreases very quickly, indicating that depth plays an important role in reducing the norm from the approximation perspective. 




%the proof is again byconstruction. the following we give an example to show...  here we give an example shows construction.




%central message: why 2/3 power? need to explain the max norm heere is Z0 by construction heuristic: the big norm split to small norms multiply: norm large >1, split into small norm multiply; norm small <1 split into a little large norm multiply this is the intuition
Here we give an example showing the construction of converting a known $1$ layer linear SSM with width $2m-1 = 7$ into a $2$ layer linear SSM with width $m=4$. Given distinct non-zero complex numbers $|\alpha_1| \le \dots \le |\alpha_7|$, we suppose that the $1$ layer SSM is defined by $B = C = (z_1, \dots, z_7)^T$ and $A = Diag\{\alpha_1, \dots, \alpha_7\}$. Let $c_0 = \max_{1 \le i \le 7} |z_i|$ and the $\rho$ defined by the above SSM satisfies $\rho \in \mathcal{H}_{c_{0},1}^{7}$, with $\rho(t) = \sum_{i=1}^7 z_i^2 \alpha_i^t$. 
\par Then, we consider constructing a $2$ layer SSM with the same $\rho$. Let $Z_0 = 2 c_0^{\frac{2}{3}}$. The following constructed $A_1, A_2, B_1, B_2, C$ defines the corresponding $2$ layer SSM of width $4$. \begin{itemize}
    \item $A_1 = Diag \{\alpha_1, \alpha_2, \alpha_3, \alpha_4\}$. 
    \item $A_2 = Diag\{\alpha_5, \alpha_6, \alpha_{7}, 0\}$. 
    \item $B_1 = C = (Z_0, Z_0, Z_0, Z_0)^T$.
    \item $B_2 = \begin{pmatrix}
 \frac{(\alpha_5 - \alpha_1)z_5^2}{\alpha_5 Z_0^2} & 0 &  0 & 0\\
 0 & \frac{(\alpha_6 - \alpha_2)z_6^2}{\alpha_6 Z_0^2} &   0& 0\\
 0 & 0 &  \frac{(\alpha_7 - \alpha_3)z_7^2}{\alpha_7 Z_0^2} & 0\\
\frac{(z_{1}^2) + (z_{5}^2)\frac{\alpha_1}{\alpha_5}}{Z_0^2}  & \frac{(z_{2}^2) + (z_{6}^2)\frac{\alpha_2}{\alpha_6}}{Z_0^2} &  \frac{(z_3^2) + (z_7^2)\frac{\alpha_{3}}{\alpha_7}}{Z_0^2} & \frac{z_{4}^2}{Z_0^2}
\end{pmatrix}$
\end{itemize} 
Then by calculating the convolutional kernel $\hat{\rho}$ of the above $2$-layer SSM, we have
\begin{align}
    \hat{\rho}(t) &= z_4^2 \alpha_4^t + \sum_{i = 1}^3 ([{(z_{i}^2) + (z_{i+4}^2)\frac{\alpha_i}{\alpha_{i+4}}}] \alpha_{i}^t +  \frac{(\alpha_{i+4} - \alpha_i)z_{i+4}^2}{\alpha_{i+4}} \sum_{s = 0}^t \alpha_i^{s} \alpha_{i+4}^{t-s})\\
    &= z_4^2 \alpha_4^t + \sum_{i = 1}^3 ([{(z_{i}^2) + (z_{i+4}^2)\frac{\alpha_i}{\alpha_{i+4}}} ]\alpha_{i}^t +  \frac{(\alpha_{i+4} - \alpha_i)z_{i+4}^2}{\alpha_{i+4}} \frac{\alpha_{i+4}^{t+1} - \alpha_i^{t+1}}{\alpha_{i+4} - \alpha_i})\\
    &= z_4^2 \alpha_4^t + \sum_{i = 1}^3 ([{(z_{i}^2) + (z_{i+4}^2)\frac{\alpha_i}{\alpha_{i+4}}} - (z_{i+4}^2)\frac{\alpha_i}{\alpha_{i+4}}]\alpha_{i}^t +  \frac{z_{i+4}^2 \alpha_{i+4}^{t+1}}{\alpha_{i+4}})\\
    &= z_4^2 \alpha_4^t + \sum_{i = 1}^3 ({z_{i}^2}\alpha_{i}^t + {z_{i+4}^2 \alpha_{i+4}^{t}})\\
    &= \rho(t)
\end{align}
Then the above is an explicit construction of converting a $1$ layer linear SSM with width $7$ into a $2$ layer linear SSM with width $4$ under the same parameter count with norm constraints. Actually, in the case of converting into $l$ layer SSM, the positions of non-zero elements of $B_2, \dots, B_l$ are the same as those $B_2$ above. 
\par From $|\alpha_i| \le |\alpha_{i+4}|$, all the terms on the $m$-th column of $B_2$ are bounded by $\frac{\max \{|z_i^2| + |z_{i+4}^2|\}}{Z_0^2} \le Z_0 = 2 c_0^{\frac{2}{3}}$, matching \textbf{Theorem \ref{theorem 4.2}}. The core idea is as follows: suppose that the $1$-layer SSM have weight norm $c_0$, the $l$ layer SSM decomposes the weight into a product of $l+1$ new weight matrices, each can take a norm of order $O(c_0^{\frac{2}{l+1}})$. A similar argument holds for the case where $c_0 < 1$, in which the norms of the weights of the deep SSM can be increased.  

% A detailed proof of \cref{theorem 4.2} can be found in Appendix. Notably, a given one-layer linear SSM with a large norm constraint 
% $c_{1}$ can be equivalently represented by an $n$-layer SSM with width $m$ and a smaller norm constraint bounded by $2c_{1}^{\frac{2}{n+1}}$. As we can see, as the number of layers increases, the corresponding norm decreases, indicating that depth plays an important role in reducing the norm from approximation perspective.


\subsection{Minimal depth for representing shallow networks}

Following \cref{section 4.2}, we have highlighted the important role of depth in enabling norm reduction in deep linear state-space models. In this section, we address a more refined question: suppose we have a SSM with weight norms bounded by $c_1$, which is possibly a very large value, how deep must a norm-constrained deep linear SSM with norm bound $c_2$ be to achieve the same expressive capacity? In fact, very large parameter norms may arise when using SSMs to learn nonsmooth or highly oscillatory memory function\citep{pascanu2013difficulty} while $c_2$ could be predetermined based on desired stability constraints imposed by the optimizer. \cref{theorem 4.2} already indicates that substantial norm reduction is possible through increased depth and we now formalize this by deriving the minimal depth required for a deep linear SSM with norm bound $c_2$ to represent a given one-layer linear SSM constrained by $c_1$.

\begin{theorem}
\label{theorem 4.3}
Suppose $\rho \in \mathcal{H}_{c_1,1}^{K+1}$ for some $K \geq 2$, and let $c_1 > 1$ and $c_2 > 2$ denote the norm constraints for the one-layer and deep linear SSM, respectively. Then, we have the following upper bound for the depth of a deep linear SSM as defined in \cref{linear deep ssm}:
\begin{equation}
\min\{l: \:  \rho\in\mathcal{H}_{c_{2},l}^{\lceil \frac{K}{l} \rceil +1}\} \le \lceil\frac{2\ln(c_{1})}{\ln{(\frac{c_{2}}{2})}}-1\rceil
\end{equation}
\end{theorem}
A detailed proof of \cref{theorem 4.3} can be found in \cref{ProofMain}.
The technique employed to determine the minimal required depth closely follows the constructive approach developed in \cref{theorem 4.2}. 
% diagnolaizable has conditional number har to deal with
%1. diagonal iinear SSM H

To maintain a constant parameter count, we set the width of an $l$-layer linear SSM to be $\lceil \frac{K}{l} \rceil + 1$. If the total number of parameters is allowed to increase, the problem becomes trivial, as parameter norms can be reduced simply by allocating more parameters. However, our results in \cref{theorem 4.3} indicate that increasing the depth is an efficient strategy even under a fixed parameter budget. This suggests that model performance may be improved by increasing depth if the parameter norm is large. We will show later in our experiments \cref{experiment}.





% Here in order to still maintain the same parameter count, we choose the width for the $l$ layer SSM be $\lceil \frac{K}{l} \rceil +1$. If parameter count can be different, then of course the problem is trivial because we can decrease norms of weights by increasing parameter counts. The log results reveals that it is efficient to increase layers. this shows for example, we might be able to improve model performance by increasing the depth even at a constant parameetr count if the parameetr norm is large. We will show later in our experiments.

\subsection{Beyond Diagonal Case}
Recall that \cref{theorem 4.1} extends to the case where the hidden matrices are diagonalizable. However, extending \cref{theorem 4.2} to diagonalizable matrices remains challenging due to the condition number induced by diagonalization. To obtain tractable norm bounds, we instead assume that each hidden matrix $A_i$ is normal, i.e., unitarily diagonalizable with a condition number $1$, which resolves this issue. 
This assumption also aligns with the HiPPO initialization commonly used in practice \citep{gu2020hippo}. Now we define the following hypothesis space:  
  % hippo intinliazton orthogonal
\begin{equation}
\begin{aligned}
  \mathcal{G}_{c,l}^{m}=&\{\rho(t):y(t)=(\rho\ast x)(t),A_{1},...,A_{l}\in\mathbb{C}^{m\times m}\thinspace\mbox{normal},B_{2},...,B_{l},\in\mathbb{C}^{m\times m},\\ &C,B_{1}\in\mathbb{C}^{m\times1},\max_{i=1,...,l}r(A_{i})<1,||C||_{\infty}\le c, ||B_{1}||_{\infty}\le c,\max_{2\le k\le l}\max_{1\le i,j\le m}|(B_{k})_{ij}|\le c\}
\end{aligned}
\end{equation}

The following result generalizes \cref{theorem 4.2} to the normal case.
% We now employ $\mathcal{G}_{c,l}^{m}$ to describe the upper bound for depth on the norm constraint of a $l$-layer SSM with width $m$ that is required to represent a given one-layer linear SSM with a large norm constraint $c_{1}$.
\begin{corollary}
Suppose $m,l\ge1$ and let $c_{1}>0$ be the norm constraint for the one-layer linear SSM. Then, we have the following upper bound for the norm constraint of $l$-layer linear SSM, where the hidden matrix $A_{i}$ is normal.
\label{Hermite property}
    \begin{equation}
    \max_{\rho\in\mathcal{G}_{c_{1},1}^{l(m-1)+1}}\min_{c_{2}>0}\{c_{2},\: \rho\in\mathcal{G}_{c_{2},l}^{m}\}\le2((l(m-1)+1)c_{1}^{2})^{\frac{1}{l+1}}
\end{equation}
\end{corollary} 

A detailed proof of \cref{Hermite property}, as a generalization of \cref{theorem 4.2}, can be found in \cref{ProofMain}. It is important to note that, unlike the upper bound in \cref{theorem 4.2}, this norm upper bound explicitly depends on both the number of layers $l$ and width $m$, which reveals the fundamental difference between deep and shallow networks in their expressive ability in a larger space.
% For fixed number of layers $l$, and suppose the width $l(m-1)+1$ of the $1$-layer SSM and the width $m$ of the $l$-layer SSM are increasing together with $m$. Then 
Precisely, unlike the unchanged bound in Theorem \ref{theorem 4.2}, the bound for this larger space is increasing at the rate of $m^{\frac{1}{l+1}}$, which can also be bounded by a constant if $l = O(\log(m))$. 
%Under the assumption that each hidden matrix is normal, norm constraints  may require an asymptotically exponential reduction in parameter norms as depth increases. However, for smaller model sizes, the bound may be loose due to the presence of multiplicative terms that depend on both the width m and the depth l.
