Generalization bound for a Shallow Transformer trained using Gradient Descent

Generalization bound for a Shallow Transformer trained using Gradient Descent

TMLR Paper5721 Authors

24 Aug 2025 (modified: 27 Nov 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this work, we establish a \textbf{norm-based generalization bound} for a \textit{shallow Transformer model} trained via gradient descent under the \textit{bounded-drift (lazy training)} regime, where model parameters remain close to their initialization throughout training. Our analysis proceeds in three stages: (a) we formally define a hypothesis class of Transformer models constrained to remain within a small neighborhood of their initialization; (b) we derive an \textbf{upper bound on the Rademacher complexity} of this class, quantifying its effective capacity; and (c) we establish an \textbf{upper bound on the empirical loss} achieved by gradient descent under suitable assumptions on model width, learning rate, and data structure. Combining these results, we obtain a \textbf{high-probability bound on the true loss} that decays sublinearly with the number of training samples $N$ and depends explicitly on model and data parameters. The resulting bound demonstrates that, in the lazy regime, wide and shallow Transformers generalize similarly to their linearized (NTK) counterparts. Empirical evaluations on both text and image datasets support the theoretical findings.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We have added a discussion (Section 4.5), we have added dedicated paragraphs comparing our results with existing norm-based bounds, as well as with PAC-Bayes and stability-based generalization frameworks. These additions provide clearer context for the novelty and distinct analytical perspective of our work. We have added experiments on text dataset We have made changes to the proofs to ensure $\mathcal{R}_S(\mathcal{F}) \leq \mathcal{R}_S(\mathcal{G})$ instead of $\mathcal{R}_S(\mathcal{F}) = \mathcal{R}_S(\mathcal{G})$. We have corrected the exponent of $\rho$ from $\rho^4$ to $\rho^2$ in Lemma 4.

Assigned Action Editor: ~Yingbin_Liang1

Submission Number: 5721

Loading