WIP: Iterative Post-training Pruning with Weighted Importance Estimation for Large Language Models

Dinh-Truong Do, Kiyoaki Shirai, Le-Minh Nguyen

Published: 01 Jan 2025, Last Modified: 05 Aug 2025NLDB (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Modern large language models (LLMs) achieve impressive accuracy but are difficult to deploy due to their enormous size and computational demands. Post-training pruning—removing redundant weights from a pre-trained model without retraining—promises to mitigate these issues but often risks channel collapse, where entire neurons are inadvertently zeroed out, especially at higher sparsity levels. We introduce a new Weighted-Iterative Pruning (WIP) approach that tackles these challenges through two key innovations. First, our weighted importance metric strikes a tunable balance between row-wise and column-wise contributions of the weight matrix, preventing over-pruning of entire channels. Second, we adopt an iterative multi-stage pruning strategy that recalculates importance scores after each partial prune, mitigating the greedy errors seen in one-shot methods. Experiments across multiple LLMs and benchmarks show that WIP preserves perplexity and zero-shot accuracy better than existing techniques, especially at high sparsities. Additionally, our 2:4 semi-structured pruned models achieve real-world inference speedups of up to 1.88\(\times \) on GPUs, underscoring WIP’s practicality for resource-constrained environments. Our code is publicly available at https://github.com/truongdo619/WIP.

External IDs:dblp:conf/nldb/DoSN25