Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines

Matthew A Wright; Joseph E. Gonzalez

Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines

Matthew A Wright, Joseph E. Gonzalez

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Transformer models, attention models, kernel methods, reproducing kernel Banach spaces

Abstract: Despite their ubiquity in core AI fields like natural language processing, the mechanics of deep attention-based neural networks like the ``Transformer'' model are not fully understood. In this article, we present a new perspective towards understanding how Transformers work. In particular, we show that the ``"dot-product attention" that is the core of the Transformer's operation can be characterized as a kernel learning method on a pair of Banach spaces. In particular, the Transformer's kernel is characterized as having an infinite feature dimension. Along the way we generalize the standard kernel learning problem to what we term a "binary" kernel learning problem, where data come from two input domains and a response is defined for every cross-domain pair. We prove a new representer theorem for these binary kernel machines with non-Mercer (indefinite, asymmetric) kernels (implying that the functions learned are elements of reproducing kernel Banach spaces rather than Hilbert spaces), and also prove a new universal approximation theorem showing that the Transformer calculation can learn any binary non-Mercer reproducing kernel Banach space pair. We experiment with new kernels in Transformers, and obtain results that suggest the infinite dimensionality of the standard Transformer kernel is partially responsible for its performance. This paper's results provide a new theoretical understanding of a very important but poorly understood model in modern machine learning.

One-sentence Summary: We obtain theoretical results relating Transformer models to generalizations of kernel methods, and empirical results concerning the importance of characteristics of the standard Transformer's kernel.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/transformers-are-deep-infinite-dimensional/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=p5P5WzwoVZ

14 Replies

Loading