An Efficient Pruner for Large Language Model with Theoretical Guarantee

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0
Abstract: Large Language Models (LLMs) have showcased remarkable performance across a range of tasks but are hindered by their massive parameter sizes, which impose significant computational and storage demands. Pruning has emerged as an effective solution to reduce model size, but traditional methods often involve inefficient retraining or rely on heuristic-based one-shot approaches that lack theoretical guarantees. In this paper, we reformulate the pruning problem as an $\ell_0$-penalized optimization problem and propose a monotone accelerated Iterative Hard Thresholding (mAIHT) method. Our approach combines solid theoretical foundations with practical effectiveness, offering a detailed theoretical analysis that covers convergence, convergence rates, and risk upper bounds. Through extensive experiments, we demonstrate that mAIHT outperforms state-of-the-art pruning techniques by effectively pruning the LLaMA-7B model across various evaluation metrics.
Lay Summary: Large Language Models (LLMs), like ChatGPT, have shown incredible capabilities in language understanding and generation. However, they come with a major drawback: their enormous size, which makes them slow, expensive, and difficult to use on many devices. To address this, researchers often use pruning — removing parts of the model that seem less important — to reduce size while maintaining performance. But common pruning methods can be either inefficient or based on heuristic strategies with little mathematical justification. In our work, we introduce a new pruning method with strong theoretical backing. We treat pruning as a mathematical problem that balances performance and simplicity, and solve it using a technique called monotone accelerated Iterative Hard Thresholding (mAIHT). Unlike many existing methods, ours comes with rigorous proofs showing it works reliably and efficiently. We also test it extensively on popular open-sourced LLMs, showing that our approach removes unnecessary parts better than leading pruning methods, all while preserving the model’s abilities. This research helps make LLMs faster, cheaper, and more accessible without sacrificing much intelligence.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Pruning, Risk Upper Bound, Monotone accelerated Iterative Hard Thresholding
Submission Number: 9789
Loading