Big Bird: Transformers for Longer Sequences
Abstract: Transformers-based models, such as BERT, have been one of the most successful
deep learning models for NLP. Unfortunately, one of their core limitations is the
quadratic dependency (mainly in terms of memory) on the sequence length due to
their full attention mechanism. To remedy this, we propose, BIGBIRD, a sparse
attention mechanism that reduces this quadratic dependency to linear. We show
that BIGBIRD is a universal approximator of sequence functions and is Turing
complete, thereby preserving these properties of the quadratic, full attention model.
Along the way, our theoretical analysis reveals some of the benefits of having
O(1) global tokens (such as CLS), that attend to the entire sequence as part of the
sparse attention mechanism. The proposed sparse attention can handle sequences
of length up to 8x of what was previously possible using similar hardware. As
a consequence of the capability to handle longer context, BIGBIRD drastically
improves performance on various NLP tasks such as question answering and
summarization. We also propose novel applications to genomics data.
0 Replies
Loading