Context-Scaling versus Task-Scaling in In-Context Learning

Amirhesam Abedsoltan; Adityanarayanan Radhakrishnan; Jingfeng Wu; Mikhail Belkin

Context-Scaling versus Task-Scaling in In-Context Learning

Amirhesam Abedsoltan, Adityanarayanan Radhakrishnan, Jingfeng Wu, Mikhail Belkin

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: in-context learning, kernel smoothers, Hilbert estimate

TL;DR: We study two scaling regimes (context-scaling and task-scaling) arising in in-context learning and identify a key mechanism through which transformers able to provably context-scale, in contrast to standard MLPs, which are only able to task-scale.

Abstract: Transformers exhibit In-Context Learning (ICL), a phenomenon in which these models solve new tasks by using examples in the prompt without additional training. In our work, we analyze two key components of ICL: (1) context-scaling, where model performance improves as the number of in-context examples increases and (2) task-scaling, where model performance improves as the number of pre-training tasks increases. While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling. To understand how transformers are capable of context-scaling, we first propose a significantly simplified transformer that performs ICL comparably to the original GPT-2 model in statistical learning tasks (e.g., linear regression, teacher-student settings). By analyzing a single layer of our proposed model, we identify classes of feature maps that enable context scaling. Theoretically, these feature maps can implement the Hilbert estimate, a model that is provably consistent for context-scaling. We then show that using the output of the Hilbert estimate along with vectorized input empirically enables both context-scaling and task-scaling with MLPs. Overall, our findings provide insights into the fundamental mechanisms of how transformers are able to learn in context.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11868

Loading