\documentclass{article}


% if you need to pass options to natbib, use, e.g.:
%     \PassOptionsToPackage{numbers, compress}{natbib}
% before loading neurips_2025


% ready for submission
\usepackage[preprint]{neurips_2025}
\usepackage{subcaption}
\usepackage{hyperref}       % hyperlinks
% to compile a preprint version, e.g., for submission to arXiv, add add the
% [preprint] option:
%     \usepackage[preprint]{neurips_2025}


% to compile a camera-ready version, add the [final] option, e.g.:
%     \usepackage[final]{neurips_2025}


% to avoid loading the natbib package, add option nonatbib:
%    \usepackage[nonatbib]{neurips_2025}


\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}    
\usepackage{amsmath}% colors
\usepackage{graphicx}
% \usepackage{cleveref}

\usepackage{amsthm}
\usepackage{amssymb}
\usepackage[capitalize,noabbrev]{cleveref}

\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}

\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}

\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

\newcommand{\jht}[1]{{\color{cyan}JHT: #1}}

% Explicitly set cleveref names
\crefname{theorem}{Theorem}{Theorems}
\crefname{proposition}{Proposition}{Propositions}
\crefname{lemma}{Lemma}{Lemmas}
\crefname{corollary}{Corollary}{Corollaries}
\crefname{definition}{Definition}{Definitions}
\crefname{assumption}{Assumption}{Assumptions}
\crefname{remark}{Remark}{Remarks}
\title{The Effect of Depth on the Expressivity of Deep Linear State-Space Models}

       
% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to break the
% lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4
% authors names on the first line, and the last on the second line, try using
% \AND instead of \And before the third author name.


\author{%
  Zeyu Bao \\
  Department of Mathematics \\
  National University of Singapore \\
  \texttt{zeyu@u.nus.edu} \\
  \And
  Penghao Yu \\
  Department of Mathematics \\
  National University of Singapore \\
  \texttt{e1353366@u.nus.edu} \\
  \And
  Haotian Jiang \\
  Department of Mathematics \\
  Institute for Functional Intelligent Materials \\
  National University of Singapore \\
  \texttt{haotian@nus.edu.sg} \\
  \And
  Qianxiao Li \\
  Department of Mathematics \\
  Institute for Functional Intelligent Materials \\
  National University of Singapore \\
  \texttt{qianxiao@nus.edu.sg} \\
}
  % examples of more authors
  % \And
  % Coauthor \\
  % Affiliation \\
  % Address \\
  % \texttt{email} \\
  % \AND
  % Coauthor \\
  % Affiliation \\
  % Address \\
  % \texttt{email} \\
  % \And
  % Coauthor \\
  % Affiliation \\
  % Address \\
  % \texttt{email} \\
  % \And
  % Coauthor \\
  % Affiliation \\
  % Address \\
  % \texttt{email} \\



\begin{document}


\maketitle


\begin{abstract}

% State-Space Models(SSMs) have gained increasing popularity in sequential modeling recently, 
% understanding the effect of depth and width remains an curcial problem. In this paper, we systematically investigate width and depth in deep linear SSMs, aiming to characterize how they influences the expressive capacity of the architecture. First, we rigorously prove that in the absence of parameter constraints, increasing depth and width are generally equivalent in expressivity of deep linear SSMs. However, under the assumption that the parameter norms are constrained, depth or width effects on expressive capacity of linear SSMs diverge significantly. We demonstrate that a shallow linear SSM with large weights of parameter norms could be represented by a deep linear SSM with smaller parameter norm using construction method. This shows particularly that deep SSMs are more capable of representing targets with large parameter norm than shallow SSMs under norm constraints. Finally, we establish upper bounds on the minimal depth required for a deep SSM to represent a given shallow SSM under a constrained parameter norm setting. We also validate our theoretical results with numerical experiments.


Deep state-space models (SSMs) have gained increasing popularity in sequence modelling. While there are numerous theoretical investigations of shallow SSMs, how the depth of the SSM affects its expressiveness remains a crucial problem. In this paper, we systematically investigate the role of depth and width in deep linear SSMs, aiming to characterize how they influence the expressive capacity of the architecture. First, we rigorously prove that in the absence of parameter constraints, increasing depth and increasing width are generally equivalent, provided that the parameter count remains within the same order of magnitude. However, under the assumption that the parameter norms are constrained, the effects of depth and width differ significantly. We show that a shallow linear SSM with large parameter norms can be represented by a deep linear SSM with smaller norms using a constructive method. In particular, this demonstrates that deep SSMs are more capable of representing targets with large norms than shallow SSMs under norm constraints. Finally, we derive upper bounds on the minimal depth required for a deep linear SSM to represent a given shallow linear SSM under constrained parameter norms. We also validate our theoretical results with numerical experiments.

% SSM has gained increasing popularity in sequential modeling recently, 
% understanding the effect of depth and width remains an curcial problem.

% In this paper, we systematically investigate how width and depth in deep linear SSMs, aiming to characterize how they influences the expressive capacity of the architecture. 

% First, we rigorously prove that without any parameter constraints, 
% increasing depth or width are generally equivalent in expressivity of deep linear SSMs. 

% However, under the assumption that the parameter norms are constrained, 
% the expressive capacity changes.

% We demonstrate that a shallow linear SSM with large weights of parameter norms could be represented by a deep linear SSM with smaller parameter norm 
%  using construction method.


% This shows particularly that deep SSMs are more capable of representing targets with large parameter norm than shallow SSMs under norm constraints.

% % This shows particularly that for learning target relationship with highly oscillatory behavior, deep SSMs are expected to outperform shallow SSMs under norm constraints. 


% Finally, we establish upper bounds on the minimal depth required for a deep SSM to represent a given shallow SSM under a constrained parameter norm setting. We also validate our theoretical results with numerical experiments.

% Shallow model with large parameter norm is equivalent with deep model with smaller parameter norm.

% Increasing depth allows the model using smaller parameter norms to represent targets with large norms. 


\end{abstract}
% 1. quite / deep SSM due to scaling property with sequence length //
% 2. you concretely anallyze hwo increasing dpth could be with depth 
% deep and shallow in general without any control equivalence
% 3. discrete / non-aymptotic
% 4. not over claim stack deep linear SSM 
% 5. depth important  linear SSM clear
% 6. in general always the smame without any further constraint however, norm constraint no longer equivalent
% rlationship betweeen width and depth Howver, norm of weigyht controoled significantly different
% 7. why difference interesting
% 8. for leanring target with high nrom, deep ssm should perform good.
% Consequence
% 9. models besides not proper words
% 10. for some sequence to sequence relationship, autoregressive task with hgh frequencey ossilatorsIn optimization, if we need big norm, then the train will not be very good.

% 11. works have
% 12. setting critize not right. focus on results but not expressivity / clustering / deep feedforward network / in other pespective / interplay 
% 13. we uset simple formulation, important difference, non-asympo   X

% 1. prove the exact equivalence, geenral case, some sepcial case previous work, can be statement clearly written \cref{theorem???}

% 2. with the constraint of norm of weights \cref{theorem???}

% 3. whose weight has large norm, 

% 1. expressivity / aapproximation of SSM: shallow / deep
%    deep linear narrow shallow 

% 2. dynamics in deep SSM
% not token dynamics, representability
% fully connected network

% 3. deep linear models Brown
% linear model applied in recurrent model
% other work good, bu

% 4. expressivity norm constraint different tranined norm prefer norm small, norm base implicit bias regularization, just support, if we need norm small, deep and shallow different

% 3. deep state-space 
% provide is ko, try or aim to X
% correct g to h

% genral form sigma X(1) use activation form sigma logic convolutional form  

% f function dimension mapping 

% lemma 3.1 define rhot

% wirte a notation, time index vector sequence use kuohao , layer use kuohao time notation may some problem  

% induction


% 91-97  norm constraint hypothesis space definition

% probelm formulation

% depp sttae-space model
% norm const

% we firts consider rthe most geenral ehich consist Ai  B1

% next we will introduce the G comman use better , S4 implementation one dimension to one dimension

% notation writhe in the paper infinity norm of vector and matrix.

% 116 proposition 

% 125 with a cretain width, how must be ...
% preovious results in []

% 128-130 delete / 

% 133 construction
% 137 minimal width not optimal

% 122 depth equivalent when c go to infinity, order O(mn)
% layer and with euivalent without norm constraiunt

% width and depth are not equivalent under norm constraint

% increasing width and depth does not matter


% 152 sup inf c2 remove c2>0

% for real setting, is 1/(n+1)

% remark directly to 153, approximate norm c1 power over n. overall, increasing norm, decrease rate quick

% talk c1>1 c1<1 norm 

% the proof of construction again. 
% central message, why here the biggest norm is Z0

% the largest is Z1, high level again, norm lager into norm small, 

% norm small split into large, no matter how near to 1

% l is not quite big, becasue the rate is expnonetail decay significant.  






% 177 Corollary fix but not K and m , such that the parameter is fixed c1 is big, c2 is small. we need a relative small c, as c increasing , how many layers.

% Ln(C)/log(c)
% 181 beyond diagnol case
% reall that diagnalizabel also consytruct 
% why realisitic, reasonale

% in thegenral case, 


% diagnoal case, 

% hermitian matrix , orthogonal matrix, normal matrix, 


% 247 conclusion
% quality verification of shift task

% quaility verification of tehorem on sequtential minist




% on the depth of tradeoff with deep linear SSM

% depth and width tradeoff with deep linear SSM


\input{1_introduction}
\input{2_theory}
\input{3_experiment}

% \newpage


\bibliography{ref}
\bibliographystyle{ref}

\newpage


% \iftrue % or \iffalse to exclude
\appendix
\input{4_appendix}

\include{5_checklist}

\end{document}
