Keywords: in-context learning, transformers, gated linear units, kernel regression
Abstract: Transformer-based models demonstrate a remarkable ability for *in-context learning* (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates.
Notably, recent research has provided insight into how the Transformer architecture can perform ICL, showing that the optimal *linear self-attention* (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks.
Building upon this understanding, we investigate ICL for *nonlinear* function classes.
We first prove that LSA is inherently incapable of outperforming linear predictors on nonlinear tasks, thereby highlighting a hard expressivity barrier for attention-only models.
To overcome this limitation, we analyze a Transformer block consisting of LSA and feed-forward layers inspired by the *gated linear units* (GLU), which is a standard component in modern Transformer architectures.
We show that this block achieves nonlinear ICL by implementing one step of gradient descent on a polynomial kernel regression loss.
Furthermore, our analysis reveals that the expressivity of a single such block is inherently limited by its dimensions.
We then show that a deep Transformer can overcome this bottleneck by distributing the computation of richer kernel functions across multiple blocks, effectively performing block-coordinate descent in a high-dimensional feature space that a single block cannot represent.
Our findings highlight that the feed-forward layers provide a crucial and scalable mechanism by which Transformers can express nonlinear representations for ICL.
Submission Number: 28
Loading