Keywords: Transformer, in-context learning
TL;DR: We examine the underlying mechanisms by which a Transformer performs in-context learning, from the perspective of functional gradient descent analysis, providing a new interpretation of softmax attention.
Abstract: We examine Transformer-based in-context learning for contextual data of the form $(x_i,y_i)$ for $i=1,\ldots,N$, and query $x_{N+1}$, where $x_i\in\Bbb{R}^d$ and $y_i\sim p(Y|f(x_i))$, with $f(x)$ a latent function. This is analyzed from the perspective of *functional* gradient descent for latent $f(x)$. We initially perform this analysis from the perspective of a reproducing kernel Hilbert space (RKHS), from which an alternative kernel-averaging perspective is manifested. This leads to a generalization, allowing an interpretation of softmax attention from the perspective of the Nadaraya-Watson kernel-weighted average. We show that a single attention layer may be designed to exactly implement a functional-gradient step in this setting (for RKHS latent functions), extending prior work for the special case of real-valued $Y$ and Gaussian $p(Y|f(x))$. This is also generalized for softmax attention and non-RKHS underlying $f(x)$. Though our results hold in a general setting, we focus on categorical $Y$ with $p(Y|f(x))$ modeled as a generalized linear model (corresponding specifically to softmax probability). Multi-layered extensions are developed for this case, and through extensive experimentation we demonstrate that for categorical $Y$ a single-layer model is often highly effective for such in-context learning. We also demonstrate these ideas for real-world data, considering in-context classification of ImageNet data, showing the broad applicability of our theory beyond the commonly-studied settings of synthetic regression data.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8112
Loading