Towards Understanding Out-of-Distribution Generalization for In-Context Learning via Low-Dimensional Subspaces

Published: 24 Apr 2026, Last Modified: 24 Apr 2026CauScale 2026EveryoneRevisionsCC BY 4.0
Keywords: in-context learning, out-of-distribution generalization, low-dimensional subspaces
TL;DR: We theoretically characterize in-context learning's out-of-distribution generalization abilities on function classes using functions sampled from low-dimensional subspaces.
Abstract: The transformer's remarkable ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its strengths and limitations. However, a theoretical understanding of when ICL can and cannot generalize beyond its pre-training data still remains unclear. This paper puts forth a minimal mathematical model that provably identifies when ICL can generalize out-of-distribution (OOD). By studying linear regression tasks parameterized with low-rank covariance matrices, we model distribution shifts as varying angles between subspaces and derive conditions under which a single-layer linear attention model interpolates across all angles. We show that if pre-training task vectors are drawn from a union of subspaces, transformers can generalize to all angle shifts, enabling ICL even in regions with zero probability mass in the training distribution. On the other hand, if the pre-training tasks are drawn from a single Gaussian, the test risk shows a non-negligible dependence on the angle, implying that ICL cannot generalize OOD. We empirically show that our results also hold for models such as GPT-2, and present experiments on how our results extend to nonlinear function classes.
Submission Number: 29
Loading