Keywords: In-context learning, attention, semi-supervision, Gaussian mixture model
Abstract: Recent research shows that in-context learning (ICL) can be effective even in settings where demonstrations have missing or incorrect labels. This motivates a deeper understanding of how sequence models leverage unlabeled data. We consider a canonical setting where the in-context demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention learns the optimal fully-supervised learner but it completely fails to leverage the unlabeled data. (2) Multilayer as well as looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $\sum_{i\ge 0} a_i (X^\top X)^i X^\top y$ with $X$ and $y$ denoting features and visible labels. We shed light on the class of polynomials that can be expressed as a function of depth/looping and draw connections to iterative pseudo-labeling.
Submission Number: 95
Loading