Keywords: Transformer, In-context Learning, Bayesian Inference
TL;DR: We study how the number of pretraining tasks affects transformer's in-context learning behavior on ridge regression.
Abstract: Pretrained transformers can do in-context learning (ICL), i.e. learn new tasks in the forward pass from a few examples provided in context.
But can the model do ICL for completely new tasks or is this ability restricted to tasks similar to those seen during pretraining?
How does the diversity of tasks seen during pretraining affect the model's ability to do ICL? In the setting of ICL for ridge regression, we show that, if pretrained on few tasks sampled from a latent distribution, the model behaves like the Bayesian estimator with a prior equal to the discrete distribution over the sampled tasks. But if pretrained on a sufficiently large number of tasks, the model behaves like the Bayesian estimator with prior equal to the underlying latent distribution over tasks. Our results suggest that, as the diversity of the pretraining dataset increases, the model transitions from doing ICL on tasks similar to ones seen during pretraining to learning the underlying task structure and doing ICL on new tasks.
0 Replies
Loading