RealFormer: Transformer Likes Residual Attention
Abstract: Transformer is the backbone of modern NLP
models. In this paper, we propose RealFormer, a simple and generic technique to
create Residual Attention Layer Transformer
networks that significantly outperform the
canonical Transformer and its variants (BERT,
ETC, etc.) on a wide spectrum of tasks
including Masked Language Modeling,
GLUE, SQuAD, Neural Machine Translation,
WikiHop, HotpotQA, Natural Questions, and
OpenKP. We also observe empirically that
RealFormer stabilizes training and leads to
models with sparser attention. Source code
and pre-trained checkpoints for RealFormer
can be found at https://github.com/
google-research/google-research/
tree/master/realformer.
0 Replies
Loading