Keywords: Compression, Pruning, Inference-Awareness
TL;DR: We present a novel structured pruning approach which is inference-aware, and achieves state-of-the-art accuracy-vs-speedup performance across the board.
Abstract: In this paper, we propose a novel structured compression approach for LLMs, called ZipLM, which achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the *post-training/one-shot* or the *gradual compression* setting, and only for specific families of models such as BERT (*encoder*) or GPT (*decoder*), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Of note is that on analyzed GLUE tasks, ZipLM compresses BERT-base up to 15x faster model while recovering $\geq 95$% accuracy. The resulting models have encoder size reduced from 85M to only 3M parameters, and on average $\leq 10$ attention heads compared to 144 heads in the uncompressed model. Moreover, ZipLM matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large architecture. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60\% smaller and 30\% faster.
Submission Number: 39
Loading