Generalization Bound for a Shallow Transformer Trained Using Gradient Descent

Published: 15 Jan 2026, Last Modified: 15 Jan 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this work, we establish a norm-based generalization bound for a shallow Transformer model trained via gradient descent under the bounded-drift (lazy training) regime, where model parameters remain close to their initialization throughout training. Our analysis proceeds in three stages: (a) we formally define a hypothesis class of Transformer models constrained to remain within a small neighborhood of their initialization; (b) we derive an upper bound on the Rademacher complexity of this class, quantifying its effective capacity; and (c) we establish an upper bound on the empirical loss achieved by gradient descent under suitable assumptions on model width, learning rate, and data structure. Combining these results, we obtain a high-probability bound on the true loss that decays sublinearly with the number of training samples $N$ and depends explicitly on model and data parameters. The resulting bound demonstrates that, in the lazy regime, wide and shallow Transformers generalize similarly to their linearized (NTK) counterparts. Empirical evaluations on both text and image datasets support the theoretical findings.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have included a "Limitations" section as well as expanded the "Related work" section as requested.
Assigned Action Editor: ~Yingbin_Liang1
Submission Number: 5721
Loading