Structured Pruning Learns Compact and Accurate ModelsDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: The growing size of neural language models has led to increased attention in model compression. Pruning methods start from a large model and gradually remove model weights---they can significantly reduce the model size but hardly achieve impressive runtime efficiency.On the other hand, distillation methods start from a shallower, compact model and can obtain large speedups---however, they are costly to train on large amounts of unlabeled data. In this work, we show that structured pruning can match the distillation counterparts in both latency ($>$10$\times$) and accuracy ($>$92\%) and result in highly compact and efficient subnetworks. Unlike distillation, our task-specific pruning approach, {\ours}, does not need to pre-specify the model architecture nor rely on unlabeled data. Our solution is to jointly prune layers and sub-modules such as heads and hidden units in Transformer models through $l_0$ regularization while ensuring that the resulting model is parallelizable. We also propose a layerwise distillation approach to further guide pruning. Finally, our pruned structures reveal interesting patterns---for example, more than 70\% of feed-forward and 50\% of self-attention layers can be easily pruned, while the first and last 1-2 layers are likely to remain for highly compressed models.
0 Replies

Loading