- TL;DR: Inference in large Transformers is expensive due to the self-attention in multiple layers. We show a simple decomposition technique can yield a faster, low memory-footprint model that is just as accurate of the original models.
- Abstract: Large pre-trained Transformers such as BERT have been tremendously effective for many NLP tasks. However, inference in these large-capacity models is prohibitively slow and expensive. Transformers are essentially a stack of self-attention layers which encode each input position using the entire input sequence as its context. However, we find that it may not be necessary to apply this expensive sequence-wide self-attention over at all layers. Based on this observation, we propose a decomposition to a pre-trained Transformer that allows the lower layers to process segments of the input independently enabling parallelism and caching. We show that the information loss due to this decomposition can be recovered in the upper layers with auxiliary supervision during fine-tuning. We evaluate de-composition with pre-trained BERT models on five different paired-input tasks in question answering, sentence similarity, and natural language inference. Results show that decomposition enables faster inference (up to 4x), significant memory reduction (up to 70%) while retaining most (up to 99%) of the original performance. We will release the code at<anonymized url>.
- Keywords: Faster Inference, Transformers, Pre-trained Representations