Structural Pruning of Transformer with Gradient-Aided Regularisation for Faster Inference

Anonymous

Structural Pruning of Transformer with Gradient-Aided Regularisation for Faster Inference

Anonymous

16 Dec 2022 (modified: 05 May 2023)ACL ARR 2022 December Blind SubmissionReaders: Everyone

Abstract: Structural pruning results in dense networks that make matrix multiplication routines efficient at minimal software change. Group lasso successfully sparsifies transformer architectures on a coarser level. However, regularisation does not distinguish between layers, applying penalties with the same force no matter whether specific parameters are crucial to performance and training flow. Forcing a model to remove such parameters, especially structurally, leads to worse quality. We propose a gradient-aided regularisation scheme that scales layer penalties based on their gradient norm during training.Experiments on neural machine translation with pruning entire attention heads and feedforward connections show that our method pushes the Pareto frontier for the current state-of-the-art. In the Estonian->English task, removing two-thirds of attention and feedforward parameters from 12-1.base architecture makes inference $1.7\times$ faster at the cost of $0.3$ BLEU in quality. In the WMT2022 Efficiency Shared Task, our pruned models maintain translation quality with $1.2$--$1.5\times$ faster translation with slightly better quality in COMET compared to the baseline submissions. The pruned models landed on the Pareto frontier, offering the best quality-speed trade-off in the CPU throughput task.

Paper Type: long

Research Area: Efficient Methods for NLP

0 Replies

Loading