Keywords: In-Context Learning, Learning Phase Transition, Double Descent, Exactly Solvable Model, Random Matrix Theory
Abstract: Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. In this work, we provide precise answers to these questions using a solvable model of ICL for a linear regression task with linear attention. We derive asymptotics for the learning curve in a regime where token dimension, context length, and pretraining diversity scale proportionally, and pretraining examples scale quadratically. Our analysis reveals a double-descent learning curve and a transition between low and high task diversity, which is empirically validated with experiments on realistic Transformer architectures.
Is Neurips Submission: No
Submission Number: 70
Loading