Not All Attention Is All You NeedDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: dropout, meta-learning, pre-trained language model, self-attention
Abstract: Dropout has shown an effective mediation to alleviate the over-fitting of neural models by forcedly blocking less helpful connections. However, common dropout has to be done crudely over all neural structures with the same dropout pattern once for all to dodge huge search space of tuning every individual structures. Thus in terms of meta-learning, we propose $AttendOut$ which is capable of performing smart unit-specific dropout for attention models. The proposed smart dropout is nearly parameter-free and makes it possible to achieve even stronger performances with a faster tuning circle even though we evaluate our proposed method on state-of-the-art pre-trained language models. Eventually, we verify the universality of our approach on extensive downstream tasks in both pre-training and fine-tuning stages.
Supplementary Material: zip
11 Replies

Loading