Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology

TMLR Paper6547 Authors

18 Nov 2025 (modified: 21 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deep temporal architectures such as Temporal Convolutional Networks (TCNs) achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap through three contributions: introducing a principled evaluation methodology for temporal models, revealing surprising empirical phenomena about temporal dependence, and establishing the first architecture-aware theoretical framework for dependent sequences. \textbf{Fair-Comparison Methodology.} We introduce evaluation protocols that fix effective sample size $N_{\text{eff}}$ to isolate temporal structure effects from information content. This addresses a fundamental challenge: temporal dependence affects both information content and learning dynamics, and standard evaluations conflate these effects. Our methodology enables principled comparison of models across dependency regimes. \textbf{Empirical Findings.} Applying this methodology reveals that under controlled $N_{\text{eff}} = 2{,}000$, strongly dependent sequences ($\rho = 0.8$) exhibit approximately $76\%$ smaller generalization gaps than weakly dependent ones ($\rho = 0.2$), challenging the conventional view that dependence universally impedes learning. However, observed convergence rates ($N_{\text{eff}}^{-1.21}$ to $N_{\text{eff}}^{-0.89}$) significantly exceed theoretical worst-case predictions ($N^{-0.5}$), revealing that temporal architectures exploit problem structure in ways current theory does not capture. \textbf{Theoretical Framework.} To provide the foundations for these empirical investigations, we develop the first architecture-aware generalization bounds for deep temporal models on exponentially $\beta$-mixing sequences. By embedding Golowich et al.'s i.i.d. class bound within a novel blocking scheme that partitions $N$ samples into approximately $B \approx N/\log N$ quasi-independent blocks, we establish polynomial sample complexity under convex Lipschitz losses. The framework achieves $\sqrt{D}$ depth scaling alongside the product of layer-wise norms $R = \prod_{\ell=1}^{D} M^{(\ell)}$, avoiding exponential dependence. While these bounds are conservative, as our empirical results demonstrate, they prove learnability and identify architectural scaling laws, providing worst-case baselines that highlight where future theory must improve to explain observed performance.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=7GZ0TcV691&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: In brief, the main updates are: \begin{itemize} \item \textbf{Corrected theoretical results.} Lemma~2 now explicitly follows the architecture-aware Rademacher bound of Golowich et al., with the full product of layer spectral norms \(R = \prod_{\ell=1}^D M^{(\ell)}\) and \(\sqrt{D}\) depth dependence. The main theorem has been restated to clearly separate the imported i.i.d.\ class bound from our new mixing-based framework for \(\beta\)-mixing sequences. \item \textbf{Online-to-batch and convexity.} Proposition~1 has been revised to make the convex Lipschitz loss assumption explicit and to clearly invoke Jensen's inequality in the online-to-batch step. The resulting generalization bound depends on the number of blocks \(B\), yielding a concentration rate of order \(\sqrt{\log N / N}\) under exponentially decaying \(\beta\)-mixing. \item \textbf{Scope and positioning.} Throughout the paper we now emphasize that we do \emph{not} extend or modify the core theorem of Golowich et al. Instead, we extend the \emph{applicability} of such architecture-aware bounds from i.i.d.\ samples to \(\beta\)-mixing time series by embedding their class bound in a blocking and delayed-feedback framework. \item \textbf{Experiments and presentation.} We clarified the ``fair comparison'' protocol, softened statements that might over-interpret the bounds (e.g., depth vs.\ data requirements), and streamlined the exposition and figures to better align the empirical results with the revised theory. \end{itemize} A detailed, point-by-point response to all reviewer and Associate Editor comments is provided in the accompanying rebuttal. We believe these changes fully address the concerns raised in the previous round and significantly strengthen the manuscript. Thank you very much for your time and consideration.
Assigned Action Editor: ~Akshay_Rangamani1
Submission Number: 6547
Loading