Keywords: training dynamics, optimization, transformer, multi-head softmax attention, in-context learning
Abstract: This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via the in-context learning of the template function for each task, where all template functions lie in a linear space with $m$ basis functions. We analyze the training dynamics of multi-head transformers to {in-contextly} predict unlabeled inputs given partially labeled prompts where the labels contain Gaussian noise and there may be only a few examples in each prompt which are not sufficient to determine the template. We show that the training loss for a shallow multi-head transformer converge linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression. To our knowledge, this study is the first of showing that transformers can learn contextual (i.e., template) information to generalize to unseen examples when prompts contain only a small number of query-answer pairs.
Submission Number: 23
Loading