Keywords: Deep Learning Theory, Covariance SDE, Attention Mechanism, Infinite-Depth-and-Width, Scaling Limit
TL;DR: We study the proportional infinite-depth-and-width limit of Transformers at initialization in order to devise a modified attention mechanism that avoids known degeneracy issues.
Abstract: In deep learning theory, the covariance matrix of the representations serves as a
proxy to examine the network’s trainability. Motivated by the success of Transform-
ers, we study the covariance matrix of a modified Softmax-based attention model
with skip connections in the proportional limit of infinite-depth-and-width. We
show that at initialization the limiting distribution can be described by a stochastic
differential equation (SDE) indexed by the depth-to-width ratio. To achieve a
well-defined stochastic limit, the Transformer’s attention mechanism is modified
by centering the Softmax output at identity, and scaling the Softmax logits by a
width-dependent temperature parameter. We examine the stability of the network
through the corresponding SDE, showing how the scale of both the drift and diffu-
sion can be elegantly controlled with the aid of residual connections. The existence
of a stable SDE implies that the covariance structure is well-behaved, even for very
large depth and width, thus preventing the notorious issues of rank degeneracy
in deep attention models. Finally, we show, through simulations, that the SDE
provides a surprisingly good description of the corresponding finite-size model.
We coin the name shaped Transformer for these architectural modifications.
Submission Number: 11329
Loading