Enhancing One-Shot Pruned Generative Pre-training Language Models through Sparse-Dense-Sparse Mechanism

Guanchen Li; Xiandong Zhao; Zeping Li; Dong Li; Lu Tian; Jie He; Ashish Sirasao

Enhancing One-Shot Pruned Generative Pre-training Language Models through Sparse-Dense-Sparse Mechanism

Guanchen Li, Xiandong Zhao, Zeping Li, Dong Li, Lu Tian, Jie He, Ashish Sirasao

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Generative pre-trained language models, Unstructured pruning, Sparse regularization, Model compression

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Generative pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks. However, their considerable size incurs significant computational and storage costs. Modern pruning strategies employ one-shot techniques to compress PLMs without the need for retraining on task-specific or otherwise general data; however, these approaches often lead to an indispensable reduction in performance. In this paper, we propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective. We outline the pruning process in three steps. Initially, we prune less critical connections in the model with conventional one-shot pruning methods. Next, we reconstruct a dense model featuring a pruning-friendly weight distribution by reactivating pruned connections with sparse regularization. Finally, we perform a second pruning round, yielding a superior pruned model compared to the initial pruning. Notably, SDS requires only a limited number of calibration samples, comparable to typical one-shot pruning methods, but significantly outperforms them. Experimental results demonstrate that, under an identical sparsity configuration, SDS outperforms the state-of-the-art pruning technique SparseGPT by decreasing language comprehension perplexity by an average of 2.4 and achieving an average accuracy improvement of over 2% across seven downstream tasks on OPTs.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5070

Loading