Surprising Deviations from Bayesian View in In-Context Learning

Madhur Panwar; Kabir Ahuja; Navin Goyal

Surprising Deviations from Bayesian View in In-Context Learning

Madhur Panwar, Kabir Ahuja, Navin Goyal

Published: 27 Oct 2023, Last Modified: 24 Apr 2024ICBINB 2023EveryoneRevisionsBibTeX

Keywords: In-context Learning, Transformers, Inductive Biases, Meta Learning, Language Modelling, Bayesian Inference

Abstract: In-context learning (ICL) is one of the surprising and useful features of large language models and subject of intense research. Recently, stylized meta-learning-like ICL setups have been devised that train transformers on sequences of input-output pairs $(x, f(x))$ using the language modeling loss. The function $f$ comes from a function class and generalization is checked by evaluation on sequences for unseen functions from the same class. One of the main discoveries in this line of research has been that for several function classes, such as linear regression, transformers successfully generalize to new functions in the class. However, the inductive biases of these models resulting in this behavior are not clearly understood. A model with unlimited training data and compute is a Bayesian predictor: it learns the pretraining distribution. In this paper we empirically examine how far this Bayesian perspective can help us understand ICL. To this end, we generalize the previous meta-ICL setup to hierarchical meta-ICL setup which involve unions of multiple task families. We instantiate this setup on multiple function families and find that transformers can do ICL in this setting as well. We make some surprising observations: Transformers can learn to generalize to new function classes that were not seen during pretraining. This requires pretraining on a very small number of function classes and involves deviating from the Bayesian predictor on the pretraining distribution. Further, we discover the phenomenon of 'forgetting', where over the course of pretraining under hierarchical meta-ICL setup, the transformer first generalizes to the full distribution of tasks and later forgets it while fitting the pretraining distribution.

Submission Number: 39

Loading