Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Lu Yin; You Wu; Zhenyu Zhang; Cheng-Yu Hsieh; Yaqing Wang; Yiling Jia; Gen Li; AJAY KUMAR JAISWAL; Mykola Pechenizkiy; Yi Liang; Michael Bendersky; Zhangyang Wang; Shiwei Liu

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, AJAY KUMAR JAISWAL, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, Shiwei Liu

Published: 05 Mar 2024, Last Modified: 12 May 2024PML4LRS PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models (LLMs), pruning, layerwise sparsity, outlier

TL;DR: We propose a novel pruning approach for LLMs, which is able to remove up to 70% of SOTA LLMs while maintaining comparable performance.

Abstract: Large Language Models (LLMs), renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Building upon insights gained from previous work, prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs, which are output features exhibiting significantly greater magnitudes compared to their counterparts. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of **non-uniform layerwise sparsity ratios**, termed as **Outlier Weighted Layerwise sparsity (OWL)**. The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by **61.22** and **6.80** perplexity at a high sparsity level of 70%, respectively, while delivering **2×** end-to-end inference speed-up in the DeepSparse inference engine.

Submission Number: 83

Loading