The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

TMLR Paper8835 Authors

08 May 2026 (modified: 23 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In training a neural network with gradient descent (GD), each iteration induces a linear operator governing the first-order updates to a model's internal state variables. We define this operator as the global empirical neural tangent kernel, $\text{NTK}_S$. In finite-width networks, $\text{NTK}_S$ is typically intractable to form, leading prior work to focus on more restrictive settings, such as tracking outputs only or taking infinite-width limits. Here, we bridge this gap by studying the structure of $\text{NTK}_S$ for a broad class of models. Formulating the model state as the solution to a single global implicit constraint, we derive $\text{NTK}_S$ as a product of two operators: $\mathcal{K}$, accounting for immediate parameter-to-state interactions, and $\mathcal{P}$, describing internal state-to-state dependencies. For a broad class of weight-based models, including RNNs and transformers, we prove a universal Kronecker-core theorem showing that $\mathcal{K}$ admits an exact forward-pass-computable form given by the Gram matrix of weight-site variables. This core structure reveals that $\text{NTK}_S$ is structurally bottlenecked, constraining its effective rank and giving rise to a \textit{self-referential bias}, whereby GD preferentially learns within dominant modes of joint hidden-input activity. For recurrent models, including GRUs and RNNs, we examine the spectrum of $\text{NTK}_S$ and show when it is biased and low-rank in space or time under the proposed decomposition. We further demonstrate that the structure of the model dynamics at initialization biases $\text{NTK}_S$, restricting learning and preventing certain task components from being learned effectively. Finally, to demonstrate broader applicability, we show that the $\text{NTK}_S$ associated with a self-attention transformer is likewise structurally constrained to be low-rank. Overall, we show that $\text{NTK}_S$ possesses tractable structure that explains GD bias toward particular task solutions and the typical emergence of low-rank representations. To further enable the use of $\text{NTK}_S$ as a practical metric, we build a library, \texttt{kpflow}, based on randomized matrix-free numerical linear algebra.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Antonio_Orvieto3

Submission Number: 8835

Loading