Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX
Keywords: in-context learning, Bayesian inference, transformers, task diversity, emergence
TL;DR: We empirically demonstrate a task diversity threshold for the emergence of in-context learning in pretrained transformers beyond which the model can learn fundamentally new tasks in-context.
Abstract: Pretrained transformers exhibit the remarkable ability of in-context learning (ICL): they can learn tasks from just a few examples provided in the prompt without updating any weights. This raises a foundational question: can ICL solve fundamentally _new_ tasks that are very different from those seen during pretraining? To probe this question, we examine ICL’s performance on linear regression while varying the diversity of tasks in the pretraining dataset. We empirically demonstrate a _task diversity threshold_ for the emergence of ICL. Below this threshold, the pretrained transformer cannot solve unseen regression tasks, instead behaving like a Bayesian estimator with the _non-diverse pretraining task distribution_ as the prior. Beyond this threshold, the transformer significantly outperforms this estimator; its behavior aligns with that of ridge regression, corresponding to a Gaussian prior over _all tasks_, including those not seen during pretraining. Thus, when pretrained on data with task diversity greater than the threshold, transformers _can_ optimally solve fundamentally new tasks in-context. Importantly, this capability hinges on it deviating from the Bayes optimal estimator with the pretraining distribution as the prior. This study also explores the effect of regularization, model capacity and task structure and underscores, in a concrete example, the critical role of task diversity, alongside data and model scale, in the emergence of ICL.
Submission Number: 14855
Loading