Not All Attention Is All You Need

Hongqiu Wu; hai zhao; Min Zhang

Not All Attention Is All You Need

Hongqiu Wu, hai zhao, Min Zhang

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone

Keywords: dropout, pre-trained language model, self-attention

Abstract: Beyond the success story of pre-trained language models (PrLMs) in recent natural language processing, they are susceptible to over-fitting due to unusual large model size. To this end, dropout serves as a therapy. However, existing methods like random-based, knowledge-based and search-based dropout are more general but less effective onto self-attention based models, which are broadly chosen as the fundamental architecture of PrLMs. In this paper, we propose a novel dropout method named AttendOut to let self-attention empowered PrLMs capable of more robust task-specific tuning. We demonstrate that state-of-the-art models with elaborate training design may achieve much stronger results. We verify the universality of our approach on extensive natural language processing tasks.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: zip

13 Replies

Loading