Abstract: The growing size of neural language models has led to increased attention in model compression. The two predominant approaches are pruning, which gradually removes weights from a pre-trained model, and distillation, which trains a smaller compact model to match a larger one. Pruning methods can significantly reduce the model size but hardly achieve large speedups as distillation. However, distillation methods require large amounts of unlabeled data and are expensive to train. In this work, we aim to close this gap and propose a structured pruning method---MixedPruning---which matches the distillation counterparts in both latency and accuracy and only incurs 5% of training cost without using unlabeled data. Our key insight is to jointly prune coarse (e.g., layers) and fine-grained (e.g., heads and hidden units) modules, which controls the pruning decision of each parameter with masks of different granularity. This pruning strategy eases optimization and delivers highly competitive and parallelizable subnetworks that were not demonstrated before. We also propose a novel layerwise distillation approach to further guide pruning. We evaluate MixedPruning extensively on SQuAD and GLUE datasets and demonstrate its effectiveness and efficiency over state-of-the-art pruning and distillation methods.
Paper Type: long
0 Replies
Loading