Keywords: Large Language Models; Structured Pruning; Evolutionary Algorithm
TL;DR: We propose an efficient automated pruning framework that enables LLMs to perform evolutionary search for self-pruning, achieving a new state-of-the-art in post-training structured pruning for LLMs.
Abstract: Despite exceptional capabilities, Large Language Models (LLMs) still face deployment challenges due to their enormous size.
Post-training structured pruning is a promising solution that prunes LLMs without the need for retraining, reducing computational overhead, and it is hardware-deployment friendly.
However, the training-free nature of post-training structured pruning leads to significant performance degradation.
We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer.
Meanwhile, we find that LLMs may have prior knowledge about their own redundancy.
Based on this insight, we introduce $\textbf{Self-Pruner}$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates.
Specifically, $\textbf{Self-Pruner}$ leverages LLMs to autonomously execute the entire evolutionary search process to search for pruning rate configurations.
In this process, LLMs are used to generate populations, select parent solutions from the current population, and perform crossover and mutation operations to produce offspring solutions.
In this way, LLMs automatically generate and evaluate a large number of candidate solutions, effectively converging to find the pruning rate configurations with minimal human intervention.
Extensive experiments demonstrate $\textbf{Self-Pruner}$'s better performance compared to existing state-of-the-art methods.
Notably, $\textbf{Self-Pruner}$ prunes LLaMA-2-70B to 49B level with only 0.80% drop in accuracy across seven commonsense reasoning tasks, achieving a 1.39$\times$ speedup on NVIDIA A100 80GB GPU. Further pruning to 35B level resulted in only a 3.80% decrease in accuracy while obtaining a 1.70$\times$ speedup. Code is available in the supplementary material.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 414
Loading