Keywords: In-context learning; Linear regression; IV regression; Transformer
TL;DR: We investigate how multi-layer softmax attention model learns to perform in-context linear regression with covariates distribution varies across sequences, revealing an intersting GD-like algorithm learned by the model.
Abstract: We study how multi-layer softmax-attention transformers perform in-context linear regression, focusing on the challenging setting where the covariate distribution varies across sequences. We show that multi-layer transformers substantially outperform single-layer models, demonstrating that depth is critical for sufficient expressivity. Through a novel probing methodology using Gaussian test data, we reveal that the transformer approximates linear regression by implementing a variant of the gradient descent algorithm. The parameters of this algorithm are dependent on model depth and data distribution, but insensitive to number of attention heads or sequence length, corresponding to a consistent diagonal structure in the learned weight matrices. Building on this insight, we show that by incorporating a chain-of-thought-style intermediate step, the transformer can solve in-context instrumental variable (IV) regression, achieving performance comparable to a two-stage least squares estimator. Our findings provide new evidence that depth enables transformers to learn sophisticated in-context algorithms, bridging the gap between empirical performance and interpretable algorithmic behavior.
Primary Area: interpretability and explainable AI
Submission Number: 15712
Loading