Abstract: We study in-context learning (ICL) with Transformers for categorical outputs $y_i$, a setting largely unexplored compared to research on real-valued $y_i$. While attention-only Transformers can, in principle, perform functional gradient descent (GD) inference for real-valued outputs, we show that categorical $y_i$ introduce a nonlinear interlayer computation. The MLP layers interleaved with attention in the standard Transformer are a natural architectural component to approximate this computation, providing a concrete role for MLPs that is absent in the real-valued setting. We characterize conditions under which attention-only models can nevertheless succeed: at early layers, when all positions share similar representations, and the softmax operates in its approximately linear regime. Our theory predicts that attention-only models should degrade at greater depth and under distribution mismatch between training and testing data -- predictions we confirm empirically across synthetic data, real-world image classification with domain shift, and surgical action triplet recognition. Guided by the analysis, we propose a sparse Transformer parameterization linked to functional GD that reduces trainable parameters by roughly $50\times$ relative to an unconstrained Transformer, with minimal performance degradation. This data efficiency proves to be particularly valuable in data-limited applications, which we demonstrate through the ICL analysis of human surgical procedures.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Alexander_S_Ecker1
Submission Number: 7518
Loading