Functional Gradients and Generalizations for Transformer In-Context Learning

Aaron T Wang; Xiang Cheng; Ricardo Henao; Lawrence Carin

Functional Gradients and Generalizations for Transformer In-Context Learning

Aaron T Wang, Xiang Cheng, Ricardo Henao, Lawrence Carin

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformer, in-context learning

TL;DR: We examine the underlying mechanisms by which a Transformer performs in-context learning, from the perspective of functional gradient descent analysis, providing a new interpretation of softmax attention.

Abstract: We examine Transformer-based in-context learning for contextual data of the form $(x_i,y_i)$ for $i=1,\ldots,N$, and query $x_{N+1}$, where $x_i\in\Bbb{R}^d$ and $y_i\sim p(Y|f(x_i))$, with $f(x)$ a latent function. This is analyzed from the perspective of *functional* gradient descent for latent $f(x)$. We initially perform this analysis from the perspective of a reproducing kernel Hilbert space (RKHS), from which an alternative kernel-averaging perspective is manifested. This leads to a generalization, allowing an interpretation of softmax attention from the perspective of the Nadaraya-Watson kernel-weighted average. We show that a single attention layer may be designed to exactly implement a functional-gradient step in this setting (for RKHS latent functions), extending prior work for the special case of real-valued $Y$ and Gaussian $p(Y|f(x))$. This is also generalized for softmax attention and non-RKHS underlying $f(x)$. Though our results hold in a general setting, we focus on categorical $Y$ with $p(Y|f(x))$ modeled as a generalized linear model (corresponding specifically to softmax probability). Multi-layered extensions are developed for this case, and through extensive experimentation we demonstrate that for categorical $Y$ a single-layer model is often highly effective for such in-context learning. We also demonstrate these ideas for real-world data, considering in-context classification of ImageNet data, showing the broad applicability of our theory beyond the commonly-studied settings of synthetic regression data.

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8112

Loading