Universal Approximation with Softmax Attention

Jerry Yao-Chieh Hu; Hude Liu; Hong-Yu Chen; Weimin Wu; Han Liu

Universal Approximation with Softmax Attention

Jerry Yao-Chieh Hu, Hude Liu, Hong-Yu Chen, Weimin Wu, Han Liu

18 Sept 2025 (modified: 27 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: universal approximation, attention, expressiveness, in-context learning

TL;DR: We prove one- and two-layer attention with linear transformation are universal approximators for continuous sequence-to-sequence functions..

Abstract: We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention’s internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention or even one-layer multi-head attention followed by a softmax function suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating gradient descent in-context. We believe these techniques hold independent interest.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 12307

Loading