Keywords: network architectures, transformers, effective field theory
TL;DR: A standard model for architectures including Transformers
Abstract: Are representations in Transformers provably optimal? We present an axiomatic theory of the Transformer architecture. First, we show that a complex-valued Transformer with linear attention and linear feed-forward residual blocks is uniquely determined by a potential field governed by leading free and interactive terms. As practical extensions of the theory, we characterize ReLU/conic/gated MLP and softmax/sparse attention via axiomatic constructions. The implications include a non-exhaustive unification of existing Transformer variants within a single formalism, and a principled foundation for future architecture search.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 122
Loading