Keywords: In-Context Learning, Attention Mechanism, In-Context Gradient Descent, Transformer, Universal Approximation
TL;DR: We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting
Abstract: We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting.
We formalize two modes of in-context algorithm emulation.
In the *task-specific mode*, for any continuously differentiable function $f: \mathbb{R} \to \mathbb{R}$, we construct a single-head softmax attention layer whose forward pass reproduces functions of the form $f(w^\top x - y)$ to arbitrary precision.
This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression).
In the *prompt-programmable mode*, we prove universality:
a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting.
Our key idea is to construct prompts that encode an algorithm’s parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation.
This construction requires no feed-forward layers and no parameter updates.
All adaptation happens through the
prompt alone.
Numerical results corroborate our theory.
These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable interpreters of algorithms.
They illuminate how GPT-style foundation models may swap algorithms via prompts alone, and establish a form of algorithmic universality in modern Transformer models.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 558
Loading