
\section{Experiments}
\label{experiment}
In this section, we validate our theoretical results through numerical experiments. Before presenting our results, we provide \cref{expansion} that facilitates the computation of the output coefficients of a deep linear SSM as described in \cref{linear deep ssm}, i.e., its direct expansion into the equivalent one-layer SSM form.
\begin{lemma}
\label{expansion}
    Suppose in \cref{linear deep ssm}, $A_{i}=diag(\lambda_{i1},...,\lambda_{im})$ where each eigenvalue is distinct. We denote the $(i,j)$ element of matrix $B_{k}$ by $b_{ij}^{(k)}=(B_{k})_{ij}$, $C^{T}=(c_{1},\cdots,c_{m})^{T}$ and $B_{1}=(b_{1},\cdots,b_{m})^{T}$. Then, by fixing $t_{l-\tilde{l}+1} = \tilde{m}$, the coefficient of the term corresponding to the $\tilde{m}$-th eigenvalue of the $A_{\tilde{l}}$ matrix, i.e., $\lambda_{\tilde{l},\tilde{m}}^{t}$ in the output expansion is
\begin{equation}
    \xi_{\tilde{l},\tilde{m}}=\sum_{t_{1}=1}^{m}...\sum_{t_{l-\tilde{l}}=1}^{m}\sum_{t_{l-\tilde{l}+2}=1}^{m}...\sum_{t_{l}=1}^{m}\frac{c_{t_{1}}b_{t_{1}t_{2}}^{(l)}b_{t_{2}t_{3}}^{(l-1)}\cdots b_{t_{l-1}t_{l}}^{(2)}b_{t_{l}}}{\prod_{\alpha=1,\alpha\neq \tilde{l}}^{l}(1-\frac{\lambda_{\alpha,t_{l-\alpha+1}}}{\lambda_{\tilde{l},t_{l-\tilde{l}+1}}})}
\end{equation}
\end{lemma}

A detailed proof can be found in \cref{ProofMain}.

\paragraph{Numerical verification of \cref{theorem 4.2}}
We first conduct numerical experiments to verify our main theorems. Specifically, given a one-layer linear SSM with width $l(m-1)+1$, we apply the construction from \cref{theorem 4.2} to reconstruct an equivalent model with $l$ layers and width $m$. We adopt a teacher–student setup, where a wide and shallow network is learned using a deep linear network as defined in \cref{linear deep ssm}. The experiments are performed for both real-valued and complex-valued parameters. In the plot \cref{fig: teacher-student}, the dots represent the maximum observed norms, while the lines depict the theoretical bounds. Overall, the experimental results verify the construction for \cref{theorem 4.2} is correct. In particular, we observe that the maximum norms decrease at the rate predicted by \cref{theorem 4.2} as the number of layers increases. 

\begin{figure}[!ht]
  \centering
  \includegraphics[width=0.5\textwidth]{figs/max_norm.pdf}
  \caption{We use a one-layer linear network to learn a deep linear network in a teacher-student setting, examining the relationship between the number of layers and the maximum norm of the corresponding parameters in the learned model.}
  \label{fig: teacher-student}
\end{figure}


\paragraph{Experiments on linear functionals}
In this section, we focus on the task of learning a linear functional impulse \citep{jorda2005estimation}, which captures long-range dependencies is known to be challenging for recurrent neural networks (RNNs)\citep{pascanu2013difficulty}. The memory function takes the form of an impulse
\begin{align}
    \rho(s, \alpha) = 
    \begin{cases}
        1 & \text{if } s =\alpha, \\
        0 & \text{otherwise}
    \end{cases}
\end{align}
Here, the parameter $\alpha$ controls the shifting distance. This task is particularly well-suited for highlighting the role of depth in deep linear state-space models, as we keep equal parameter count while increasing depth. The architecture used in these experiments is identical to the one analyzed in our theoretical framework, i.e. deep linear SSMs. We use models of the same effective size, that is, the total width 
$l(m-1)+1$ is fixed. As shown in \cref{fig: impluse}, the approximation error decreases with better performance when we try to fix the same expressivity as the number of layers increases. Each point on the additional line in the plot represents the norm of a one-layer linear SSM that has equivalent representational capacity to the corresponding deep linear SSM. We find that increasing the number of layers while reducing the width by keeping the total number of effective parameters fixed leads to improved performance. However, this improvement comes with a trade-off: as the number of layers increases, the computational speed becomes slower. Furthermore, we observe that as the depth of the model increases, the corresponding norm required by the equivalent one-layer SSM also increases. we compute the corresponding norm of one-layer SSM by \cref{expansion}. This demonstrates that width contributes to improved model performance when the parameter count is fixed.

\begin{figure}[!ht]
    \centering
    \begin{subfigure}[b]{0.45\linewidth}
        \centering
        \includegraphics[width=1\linewidth]{figs/impulse_plot.pdf}
    \end{subfigure}
    \begin{subfigure}[b]{0.42\linewidth}
        \centering
        \includegraphics[width=1\linewidth]{figs/impulse_plot_time_per_epoch.pdf}
    \end{subfigure}
\caption{The left panel shows the training error and the maximum norm of the equivalent one-layer SSM computed by $\cref{expansion}$ when using linear SSMs of varying depths, all designed to achieve the same expressivity, to learn a linear functional impulse. The right panel reports the per-epoch runtime of SSMs with different depths, normalized by the runtime of a one-layer linear SSM.}
\label{fig: impluse}
\end{figure}

\paragraph{Experiments on nonlinear S4 model on MNIST} We aim to validate our theoretical findings on a practical model. To this end, we conduct experiments with the S4 model \citep{smith2022simplified} on the sequential MNIST task \citep{xiao2017fashion}. The architecture used in this experiment is
\begin{align}
     \text{SSM}_{l,m} \to \text{FFN}
\end{align}
where $\text{FFN}$ denotes a feedforward network. We give a clearer explanation in \cref{appendx:S4D structure}. When $l=1$, this reduces to the standard S4D model\citep{gupta2022diagonal}. When $l \geq 1$, we reduce $d_{\text{state}}$ to keep the width fixed, thereby preserving the expressivity of the model. From \cref{fig: mnist}, we observe that the performance improves when a wide one-layer SSM is replaced with a multi-layer, narrower SSM. The experimental results show that increasing the number of layers, while reducing the hidden dimension to maintain the same expressivity, can lead to improved model performance. However, this improvement also comes with the same trade-off: as the number of layers increases, the computational speed becomes slower. Future work may explore the development of more efficient methods for handling multi-layer computations, which could help balance the benefits of depth with practical concerns of computational cost.
\begin{figure}[!ht]
    \centering
    \begin{subfigure}[b]{0.45\linewidth}
        \centering
        \includegraphics[width=1\linewidth]{figs/mnist.pdf}
    \end{subfigure}
    \begin{subfigure}[b]{0.42\linewidth}
        \centering
        \includegraphics[width=1\linewidth]{figs/mnist_plot_time_per_epoch.pdf}
    \end{subfigure}
\caption{The left panel shows the accuracy and the maximum norm of the equivalent one-layer computed by \cref{expansion} for the S4 model of varying depths on MNIST. All are designed to achieve the same expressivity. The right panel reports the per-epoch runtime of SSMs with different depths, normalized by the runtime of a one-layer SSM.}
\label{fig: mnist}
\end{figure}


% max norm of equivalent one layer 
\section{Conclusion}
\label{conclusion}
In this work, we investigate the effect of depth on the expressivity of deep linear state-space models under norm constraints. Our theoretical analyses in \cref{theorem 4.1} reveal a nontrivial dependence of model expressivity on depth when the total parameter count is held constant. In particular, while increasing depth and width are generally equivalent in the absence of norm constraints, their roles differ significantly under norm-bounded regimes. In \cref{theorem 4.2}, we also show that increasing depth can significantly reduce the required parameter norms under the same parameter count, especially in tasks that demand large-norm representations such as modeling oscillatory or non-smooth memory functions\citep{pascanu2013difficulty}. This norm reduction highlights the effectiveness of deeper architectures as our experiments shows in \cref{experiment}. Moreover, our results suggest a promising direction for future research: factoring shallow SSMs into deeper ones can enhance expressivity and generalization, even in classical implementations. While this may incur a runtime cost due to increased depth, it opens the door to better optimization and stability, especially when norm control is critical. Understanding this trade-off more thoroughly in nonlinear or real-world SSM variants remains an exciting avenue for future exploration.

There are some important limitations in current work. 
\begin{itemize}
    \item In this paper, we focus on the linear setting, though the results are expected to generalize to the nonlinear case. A key limitation is that the convolutional representation remains unaddressed. Nonetheless, our experiments are conducted on a nonlinear model, which suggests that the theoretical insights may carry over.
    \item In this paper, we analyze the role of depth from an approximation perspective. When considering implicit or explicit regularization dynamics, norm constraints naturally arise, leading to a fundamentally different setting.
\end{itemize}

% In this paper, we characterize the differences in expressive capacity between deep and shallow architectures. We rigorously prove that, in the absence of norm constraints, depth and width are equivalent in terms of expressive capacity for deep linear SSMs. Moreover, we establish the minimal width required to achieve this expressivity. We also demonstrate that a shallow SSM with large-norm weights can be
% equivalently represented by a deep SSM with smaller-norm weights, under explicit norm constraints on the model parameters. Furthermore, We establish upper bounds on the minimal depth required for a deep linear SSM to represent a given shallow linear SSM in the regime of constrained parameter norms. Finally, we validate our theoretical results with numerical experiments

% To clearly describe the structure of the convolutional kernel in a deep state-space model, we consider the networks in a linear setting, even though this assumption is not practical for real-world models such as S4 \citep{smith2022simplified} and Mamba \cite{gu2023mamba}. The geenral integral form of a multi-layer state-space model is presented as follows:
% \begin{equation}
%     g_{\tau}(t)=\int_{S}\rho_{A_{\tau},B_{\tau}}(t-s)f_{\tau}(h_{\tau-1}(s))ds
% \end{equation}
% where $\rho$ is convolutional kernel depending on hidden matrices, $f_{\tau}$ contains nonlinearity and $h_{\tau}(t)$ means $\tau$th-layer hidden state at time step $t$. When attempting to characterize the difference in expressivity between deep and shallow networks, nonlinearity presents a significant challenge, which motivates the use of more advanced techniques. In fact, our experiments are conducted on a nonlinear model, suggesting that the insights do carry over. 

% \paragraph{Optimization dynamics}
% if we consider implicit or explicit regularization dynamics, we must have norm constraints. under this norm constraints, the layer is nontrivial dependence between expressivty of model and layer even a constant parameter counts. 
% % To the best of our knowledge, this is the first work to investigate the role of depth in deep state-space models within a non-asymptotic framework. 

% In this paper, we characterize the differences in expressive capacity between deep and shallow architectures. We begin by defining an appropriate hypothesis space for deep SSMs through the identification of their corresponding convolutional kernels. We first establish representational equivalence between deep and shallow networks under certain conditions and derive the minimal width required for expressivity. Furthermore, we demonstrate that a shallow SSM with a large parameter norm can be equivalently represented by a deep SSM with a smaller norm, leading to improved performance and stability in the context of a min-max problem. Finally, we determine the minimal depth necessary for a deep SSM to represent a given shallow SSM under constraints on the parameter norm.