Let LLM Tell What to Prune and How Much to Prune

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a pruning method that targets multiple LLM modules with dynamic pruning ratios by quantifying the complex interactions within LLMs, achieving better trade-off between efficiency and performance.
Abstract: Large language models (LLMs) have revolutionized various AI applications. However, their billions of parameters pose significant challenges for practical deployment. Structured pruning is a hardware-friendly compression technique and receives widespread attention. Nonetheless, existing literature typically targets a single structure of LLMs. We observe that the structure units of LLMs differ in terms of inference cost and functionality. Therefore, pruning a single structure unit in isolation often results in an imbalance between performance and efficiency. In addition, previous works mainly employ a prescribed pruning ratio. Since the significance of LLM modules may vary, it is ideal to distribute the pruning load to a specific structure unit according to its role within LLMs. To address the two issues, we propose a pruning method that targets multiple LLM modules with dynamic pruning ratios. Specifically, we find the intrinsic properties of LLMs can guide us to determine the importance of each module and thus distribute the pruning load on demand, i.e., what to prune and how much to prune. This is achieved by quantifying the complex interactions within LLMs. Extensive experiments on multiple benchmarks and LLM variants demonstrate that our method effectively balances the trade-off between efficiency and performance.
Lay Summary: Large language models (LLMs) have revolutionized various AI applications, but their billions of parameters pose significant challenges for practical deployment. A common solution is to prune unimportant parts of the LLM to reduce its size and improve efficiency. However, most existing pruning methods focus on a single structure of LLMs, which can upset the balance between performance and efficiency. We observe that different modules of an LLM serve different roles. In particular, we find the intrinsic properties of LLMs can help us determine the importance of each module and accordingly allocate pruning ratios more effectively. This is achieved by quantifying the complex interactions within LLMs. Therefore, in this work, we propose a pruning method that targets multiple LLM modules, with pruning ratios dynamically assigned according to the relative importance of each module. Extensive experiments demonstrate that our method effectively achieves a better trade-off between efficiency and performance.
Primary Area: Deep Learning->Large Language Models
Keywords: large language models, model compression, network pruning, structured pruning
Submission Number: 16094
Loading