AutoAttention: Automatic Attention Head Selection Through Differentiable PruningDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: Multi-head attention is considered as a driving force and key component behind the state-of-art transformer models. However, recent research reveals that there are many redundant heads with duplicated patterns in each layer. In this work, we propose an automatic pruning strategy using differentiable binary gates to remove redundant heads. We relax the binary head pruning problem into a differentiable optimization by employing Straight Through Estimators (STEs), in which the model weights and head-sparse model structure can be jointly learned through back-propagation. In this way, attention heads can be pruned efficiently and effectively. Experimental results on the General Language Understanding Evaluation (GLUE) benchmark are provided using BERT model. We could reduce more than 57% heads on average with zero or minor accuracy drop on all nine tasks and even achieve better results than state-of-the-arts (e.g., Random, HISP, $L0$ Norm, SMP, etc). Furthermore, our proposed method can prune more than 79% heads with only 0.82% accuracy degradation on average. We further illustrate the pruning procedure and parameters change through the head attention visualization and show how the trainable gate parameters determine the head mask and the final attention map.
Paper Type: long
0 Replies

Loading