Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology

TMLR Paper6547 Authors

18 Nov 2025 (modified: 03 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Learning from time series is fundamentally different from learning from i.i.d.\ data: temporaldependence can make long sequences effectively information-poor, yet standard evaluation protocols conflate sequence length with statistical information. We propose a dependence-aware evaluation methodology that controls for effective sample size $N_{\text{eff}}$ rather than raw length $N$, and provide end-to-end generalization guarantees for Temporal Convolutional Networks (TCNs) on $\beta$-mixing sequences. Our analysis combines a blocking/coupling reduction that extracts $B = \Theta(N/\log N)$ approximately independent anchors with an architecture-aware Rademacher bound for $\ell_{2,1}$-norm-controlled convolutional networks, yielding $O(\sqrt{D\log p / B})$ complexity scaling in depth $D$ and kernel size $p$. Empirically, we find that stronger temporal dependence can \emph{reduce} generalization gaps when comparisons control for $N_{\text{eff}}$ - a conclusion that reverses under standard fixed-$N$ evaluation, with observed rates of $N_{\text{eff}}^{-0.9}$ to $N_{\text{eff}}^{-1.2}$ substantially faster than the worst-case $O(N^{-1/2})$ mixing-based prediction. Our results suggest that dependence-aware evaluation should become standard practice in temporal deep learning benchmarks.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=7GZ0TcV691&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: In brief, the main updates are: \begin{itemize} \item \textbf{Corrected theoretical results.} Lemma~2 now explicitly follows the architecture-aware Rademacher bound of Golowich et al., with the full product of layer spectral norms \(R = \prod_{\ell=1}^D M^{(\ell)}\) and \(\sqrt{D}\) depth dependence. The main theorem has been restated to clearly separate the imported i.i.d.\ class bound from our new mixing-based framework for \(\beta\)-mixing sequences. \item \textbf{Online-to-batch and convexity.} Proposition~1 has been revised to make the convex Lipschitz loss assumption explicit and to clearly invoke Jensen's inequality in the online-to-batch step. The resulting generalization bound depends on the number of blocks \(B\), yielding a concentration rate of order \(\sqrt{\log N / N}\) under exponentially decaying \(\beta\)-mixing. \item \textbf{Scope and positioning.} Throughout the paper we now emphasize that we do \emph{not} extend or modify the core theorem of Golowich et al. Instead, we extend the \emph{applicability} of such architecture-aware bounds from i.i.d.\ samples to \(\beta\)-mixing time series by embedding their class bound in a blocking and delayed-feedback framework. \item \textbf{Experiments and presentation.} We clarified the ``fair comparison'' protocol, softened statements that might over-interpret the bounds (e.g., depth vs.\ data requirements), and streamlined the exposition and figures to better align the empirical results with the revised theory. \end{itemize} A detailed, point-by-point response to all reviewer and Associate Editor comments is provided in the accompanying rebuttal. We believe these changes fully address the concerns raised in the previous round and significantly strengthen the manuscript. Thank you very much for your time and consideration.
Assigned Action Editor: ~Akshay_Rangamani1
Submission Number: 6547
Loading