Keywords: neural fields, continuous representation, attention, softmax, gradient descent
TL;DR: This paper proves that the attention mechanism in Transformers is mathematically equivalent to performing gradient-based optimization on a neural field, revealing that Transformers possess intrinsic properties for learning continuous functions.
Abstract: We establish a mathematical connection between neural field optimization and Transformer attention mechanics. First, we prove that Transformer-based operators learning a neural field are equivariant to affine transformations (translations and positive scalings) when using relative positional encodings and coordinate normalization, extending geometric deep learning to meta-learning of continuous functions. Second, we demonstrate that linear attention is an exact computation of the negative gradient of squared-error loss for sinusoidal neural fields, with softmax attention shown empirically and theoretically to converge to such an identity at rate $O(\tau^{-2})$ as temperature scales. The novel results reveal that attention mechanisms have an implicit geometric encoding that is well-suited to learn continuous functions.
Video Link: https://youtu.be/RkK2jd1pY94
Submission Number: 157
Loading