Not All Attention Is All You Need

Hongqiu Wu; hai zhao; Min Zhang

Not All Attention Is All You Need

Hongqiu Wu, hai zhao, Min Zhang

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: dropout, meta-learning, pre-trained language model, self-attention

Abstract: Dropout has shown an effective mediation to alleviate the over-fitting of neural models by forcedly blocking less helpful connections. However, common dropout has to be done crudely over all neural structures with the same dropout pattern once for all to dodge huge search space of tuning every individual structures. Thus in terms of meta-learning, we propose $AttendOut$ which is capable of performing smart unit-specific dropout for attention models. The proposed smart dropout is nearly parameter-free and makes it possible to achieve even stronger performances with a faster tuning circle even though we evaluate our proposed method on state-of-the-art pre-trained language models. Eventually, we verify the universality of our approach on extensive downstream tasks in both pre-training and fine-tuning stages.

Supplementary Material: zip

11 Replies

Loading