Abstract: In this work, we develop a norm-based generalization bound for a shallow Transformer model trained using Gradient Descent. This is achieved in three major steps i.e., (a) Defining a class of Transformer models whose weights stay close to their initialization during training. (b) Upper bounding the Rademacher complexity of this class. (c) Upper bounding the empirical loss of all transformer models belonging to the above-defined class for all training steps. We end up with an upper bound on the true loss which tightens sublinearly with increasing number of training examples $N$ for all values of model dimension $d_m$. We also perform experiments on MNIST dataset to support our theoretical findings.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yingbin_Liang1
Submission Number: 5721
Loading