Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models

Yingtao Zhang; Haoli Bai; Haokun Lin; Jialin Zhao; Lu Hou; Carlo Vittorio Cannistraci

Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models

Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, Carlo Vittorio Cannistraci

Published: 16 Jan 2024, Last Modified: 21 Apr 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Post-Training Pruning, Combinatorial Optimization, Large Language Models, Inference Acceleration

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: By integrating Relative Importance and Activations and Channel Permutation, we present a plug-and-play solution for post-training pruning of LLMs, which accelerates the inference speed of LLMs without performance degradation.

Abstract: With the rapid growth of large language models (LLMs), there is increasing demand for memory and computation in LLMs. Recent efforts on post-training pruning of LLMs aim to reduce the model size and computation requirements, yet the performance is still sub-optimal. In this paper, we present a plug-and-play solution for post-training pruning of LLMs. The proposed solution has two innovative components: 1) **Relative Importance and Activations (RIA)**, a new pruning metric that jointly considers the weight and activations efficiently on LLMs, and 2) **Channel Permutation**, a new approach to maximally preserves important weights under N:M sparsity. The two proposed components can be readily combined to further enhance the N:M semi-structured pruning of LLMs. Our empirical experiments show that RIA alone can already surpass all existing post-training pruning methods on prevalent LLMs, e.g., LLaMA ranging from 7B to 65B. Furthermore, N:M semi-structured pruning with channel permutation can even outperform the original LLaMA2-70B on zero-shot tasks, together with practical speed-up on specific hardware. Our code is available at: https://github.com/biomedical-cybernetics/Relative-importance-and-activation-pruning

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: general machine learning (i.e., none of the above)

Submission Number: 1198

Loading