Abstract: Here we investigate the intrinsic and extrinsic structure of the attention heads in transformers. In particular, we provide theoretical evidence the self-attention mechanism is invariant to softmax activation by appealing to paradfferential calculus. This theory is accompanied by computational evidence which relies on organizing the attention heads to have mixed holder regularity. Furthermore we present a methodology for examining network structure that constructs a hierarchial organization of the network with respect to the query, key, and head axes of network 3-tensors using partition trees. Such an organization is consequential, as it allows one to profitably execute common signal processing exercises on a geometry where network 3-tensors exhibits regularity. We exemplify this qualitatively and quantitatively by visualizing the hierarchical organization of the tree comprised of attention heads along with diffusion map embeddings, and investigating network sparsity using the l1 entropy of the attention heads with respect to the bi-haar basis on the space of queries and keys. Instances of network sparsity are detected when deploying this methodology across multiple models and multiple datasets. The ramifications of these findings are two-fold: (1) a subsequent step in interpretability analysis is theoretically admitted, and can be exploited empirically for downstream interpretability tasks (2) one can use the network 3-tensor organization for model pruning by determining which attention heads in the network are least involved in data processing by virtue of the network sparsity
Loading