Keywords: transformers, in-context learning, few-shot learning
TL;DR: Together our results suggest that the impressive ICL abilities of high-capacity transformer models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.
Abstract: Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can generalize beyond their pretraining data mixture, comprised of one or multiple function classes, to identify and learn new functions in-context which are outside the pretraining distribution. To investigate this question in a controlled setting, we focus on the transformers ability to in-context learn functions from simulated data. While these models do well at generalizing to new functions withing the pretrained function class, when presented with tasks or functions which are out-of-distribution from their pretraining data, we demonstrate various failure modes of transformers. Together our results suggest that the impressive ICL abilities of high-capacity transformer models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.
Submission Number: 21
Loading