Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time

Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time

ICLR 2026 Conference Submission9983 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deep Learning, scaling laws, in-context learning, transformers, attention

TL;DR: A theory of scaling laws for ICL regression that predicts optimal width/depth shapes.

Abstract: We study in-context learning (ICL) of linear regression in a deep linear self-attention model, characterizing how performance depends on various computational and statistical resources (width, depth, number of training steps, batch size and data per context). In a joint limit where data dimension, context length, and residual stream width scale proportionally, we analyze the limiting asymptotics for three ICL settings: (1) isotropic covariates and tasks (ISO), (2) fixed and structured covariance (FS), and (3) where covariances are randomly rotated and structured (RRS). For ISO and FS settings, we find that depth only aids ICL performance if context length is limited. Alternatively, in the RRS setting where covariances change across contexts, increasing the depth leads to significant improvements in ICL, even at infinite context length. This provides a new solvable toy model of neural scaling laws which depends on both width and depth of a transformer and predicts an optimal transformer shapes as a function of compute.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 9983

Loading