Keywords: learning theory, in-context learning, representation learning, Transformer, single-index model
TL;DR: Transformers can efficiently learn low-dimensional nonlinear function in-context and outperform algorithms directly working on the test prompt.
Abstract: Transformers can efficiently learn in-context from example demonstrations, and existing theoretical analyses studied this in-context learning (ICL) ability for linear function classes.
However, this simplified linear setting arguable does not demonstrate the statistical efficiency of ICL, since the trained transformer does not outperform directly doing linear regression on the test prompt.
We study ICL of a nonlinear function class in the form as $f_*(\boldsymbol{x}) = \sigma_*(\langle\boldsymbol{x},\ \boldsymbol{\beta}\rangle)$, called single-index model, via transformer with nonlinear MLP layer.
When the index features $\boldsymbol{\beta}\in\mathbb{R}^d$ are drawn from a rank-$r$ subspace, we show that a nonlinear transformer optimized by gradient descent learns $f_*$ in-context with a prompt length that only depends on the dimension of function class $r$; in contrast, an algorithm that directly learns $f_*$ on test prompt yields a statistical complexity that scales with the ambient dimension $d$.
Our result highlights the adaptivity of ICL to low-dimensional structures of the function class.
Submission Number: 48
Loading