Keywords: in-context learning, transformer, deep learning theory, generalization analysis
TL;DR: We study the optimization and generalization of a single-head, one-layer Transformer for in-context learning on classification tasks.
Abstract: Transformer-based large language models have displayed impressive capabilities in the domain of in-context learning, wherein they use multiple input-output pairs to make predictions on unlabeled test data. To lay the theoretical groundwork for in-context learning, we delve into the optimization and generalization of a single-head, one-layer Transformer in the context of multi-task learning for classification. Our investigation uncovers that lower sample complexity is associated with increased training-relevant features and reduced noise in prompts, resulting in improved learning performance.  The trained model exhibits the mechanism to first attend to demonstrations of training-relevant features and then decode the corresponding label embedding. Furthermore, we delineate the necessary conditions for successful out-of-domain generalization for in-context learning, specifically regarding the relationship between training and testing prompts.
Submission Number: 70
Loading