Abstract: In this work, we establish a \textbf{norm-based generalization bound} for a \textit{shallow Transformer model} trained via gradient descent under the \textit{bounded-drift (lazy training)} regime, where model parameters remain close to their initialization throughout training. Our analysis proceeds in three stages: (a) we formally define a hypothesis class of Transformer models constrained to remain within a small neighborhood of their initialization; (b) we derive an \textbf{upper bound on the Rademacher complexity} of this class, quantifying its effective capacity; and (c) we establish an \textbf{upper bound on the empirical loss} achieved by gradient descent under suitable assumptions on model width, learning rate, and data structure. Combining these results, we obtain a \textbf{high-probability bound on the true loss} that decays sublinearly with the number of training samples $N$ and depends explicitly on model and data parameters. The resulting bound demonstrates that, in the lazy regime, wide and shallow Transformers generalize similarly to their linearized (NTK) counterparts. Empirical evaluations on both text and image datasets support the theoretical findings.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have added a discussion (Section 4.5), we have added dedicated paragraphs comparing our results with existing norm-based bounds, as well as with PAC-Bayes and stability-based generalization frameworks. These additions provide clearer context for the novelty and distinct analytical perspective of our work.
We have added experiments on text dataset
We have made changes to the proofs to ensure $\mathcal{R}_S(\mathcal{F}) \leq \mathcal{R}_S(\mathcal{G})$ instead of $\mathcal{R}_S(\mathcal{F}) = \mathcal{R}_S(\mathcal{G})$.
We have corrected the exponent of $\rho$ from $\rho^4$ to $\rho^2$ in Lemma 4.
Assigned Action Editor: ~Yingbin_Liang1
Submission Number: 5721
Loading