Grassmannian Optimization Drives Generationlization in Overparameterized DNN

Changfeng Wang

Grassmannian Optimization Drives Generationlization in Overparameterized DNN

Changfeng Wang

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: SGD, GD, optimization algorithms, generalization gap, overparameterized, DNN, overfitting, generalization theory, statistical learning theory, fiber bundle

TL;DR: A rigorous geometric theory of generalization in deep learning, overparameterized DNN. Resolves decade-old open problem of generalization gap estimation. A unified math framework for alchemy to principled deisgn

Abstract: We present an overview of a geometric theory explaining why and how heavily overparameterized deep neural networks generalize despite being able to perfectly fit random labels. The key insight is that, contrary to the uniform hypothesis-class assumptions of classical statistical learning theory, deep learning admits an iso-loss–induced \emph{fiber bundle structure} shaped jointly by the loss function, hypothesis class, and data distribution. Gradient-based optimization follows horizontal lifts across low-dimensional subspaces in the Grassmannian $\mathrm{Gr}(r,p)$, where $r \ll p$ is the rank of the Hessian at the optimum. The low-dimensional subspace is selected by random initialization near the origin and shaped by the data and the local trivialization structure. This yields: (i) a mechanistic explanation---effective complexity is $r$, not the ambient dimension $p$, because the $(p-r)$-dimensional fibers $F=\ker(H)$ are statistically inert; (ii) a unifying geometric framework for flat minima, PAC-Bayes, NTK, double descent, and implicit algorithmic regularization; and (iii) a closed-form finite-sample generalization gap equation together with a bias–variance decomposition (a theoretical scaling law) of the generalization error. Empirical evaluations of the gap equation achieve $>90\%$ predictive accuracy, improving upon VC, PAC-Bayesian, and spectral bounds by orders of magnitude. The equation resolves the long-standing open problem of explaining generalization in gradient-trained overparameterized DNNs~\citep{kawaguchi2023gdl,zhang2021understanding}. The degeneracy of the Hessian post-training is thus a hallmark of generalization, rather than an empirical curiosity. The framework provides a practical path for transforming current trial-and-error deep learning practice—especially for large models—into principled design and engineering. Practical translation to large-scale models requires computational innovations that we identify as key collaborative directions.\footnote{This paper presents Part~1 of a broader framework. Part~2 (Optimization Dynamics) develops how hyperparameters map to generalization through the same geometry. Complete proofs appear in the full version~\citep{CFWang2025}.}

Submission Number: 90

Loading