Keywords: Transformers, In-Context Learning, Data Mixtures, Model Selection
TL;DR: We study the sample efficiency of transformers to in-context learn different components of the task mixtures in their pretraining data in a controlled setting.
Abstract: In-context learning (ICL) refers to the ability of Large Language Models (LLMs) to perform new tasks by conditioning on input-output samples without any parameter updates. Previous work has established that, in a controlled setting, transformers can optimally perform ICL for tasks from a single task family, here a single function class, when they are pretrained on example tasks from that family. Using this setting, we probe the relationship between the pretraining data mixtures and downstream ICL performance. In particular, we empirically explore the ability of pretrained transformers to \textit{select a family of tasks} (i.e. amongst distinct function classes) and \textit{perform learning within that task family} (i.e. learn a function within a function class), all in-context. We show, for pretraining task mixtures balanced across task families, the cost of unsupervised downstream ICL task-family selection is near-zero. For task families rarely seen in pretraining, downstream ICL learning curves exhibit complex, task-dependent non-monotonic behavior. We also characterize the benefit of conditional pretraining in this simplified model, showing how task-family instructions can reduce the overhead of in-context task-family selection.
Submission Number: 15
Loading