The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX
Keywords: Deep Learning Theory, Covariance SDE, Attention Mechanism, Infinite-Depth-and-Width, Scaling Limit
TL;DR: We study the proportional infinite-depth-and-width limit of Transformers at initialization in order to devise a modified attention mechanism that avoids known degeneracy issues.
Abstract: In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network’s trainability. Motivated by the success of Transform- ers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer’s attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffu- sion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.
Submission Number: 11329