- TL;DR: Easy-to-implement and effective multidimensional Transformer with faster sampling
- Abstract: Self-attention effectively captures large receptive fields with high information bandwidth, but its computational resource requirements grow quadratically with the number of points over which attention is performed. For data arranged as large multidimensional tensors, such as images and videos, the quadratic growth makes self-attention prohibitively expensive. These tensors often have thousands of positions that one wishes to capture and proposed attentional alternatives either limit the resulting receptive field or require custom subroutines. We propose Axial Attention, a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. The Axial Transformer uses axial self-attention layers and a shift operation to efficiently build large and full receptive fields. Notably the proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. This semi-parallel structure goes a long way to making decoding from even a very large Axial Transformer broadly applicable. We demonstrate state-of-the-art results for the Axial Transformer on the ImageNet-32 and ImageNet-64 image benchmarks as well as on the BAIR Robotic Pushing video benchmark. We open source the implementation of Axial Transformers.
- Keywords: self-attention, transformer, images, videos