Keywords: in-context learning, transformers, deep learning theory, learning theory
TL;DR: Transformers can learn in context like statisticians --- A single transformer can select different algorithms for different data at hand.
Abstract: This work advances the understandings of the remarkable \emph{in-context learning} (ICL) abilities of transformers---the ability of performing new tasks when prompted with training and test examples, without any parameter update to the model. We begin by showing that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, convex risk minimization for generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various in-context data distributions. Our transformer constructions admit mild bounds on the number of layers and heads, and can be learned with polynomially many pretraining sequences.
Building on these ``base'' ICL algorithms, intriguingly, we show that transformers can implement more complex ICL procedures involving \emph{in-context algorithm selection}, akin to what a statistician can do in real life---A \emph{single} transformer can adaptively select different base ICL algorithms---or even perform qualitatively different tasks---on different input sequences, without any explicit prompting of the right algorithm or task. In theory, we construct two general mechanisms for algorithm selection with concrete examples: (1) Pre-ICL testing, where the transformer determines the right task for the given sequenceby examining certain summary statistics of the input sequence; (2) Post-ICL validation, where the transformer selects---among multiple base ICL algorithms---a near-optimal one for the given sequence using a train-validation split. Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures.
Submission Number: 14
Loading