Abstract: Modern large language models (LLMs) achieve impressive accuracy but are difficult to deploy due to their enormous size and computational demands. Post-training pruning—removing redundant weights from a pre-trained model without retraining—promises to mitigate these issues but often risks channel collapse, where entire neurons are inadvertently zeroed out, especially at higher sparsity levels. We introduce a new Weighted-Iterative Pruning (WIP) approach that tackles these challenges through two key innovations. First, our weighted importance metric strikes a tunable balance between row-wise and column-wise contributions of the weight matrix, preventing over-pruning of entire channels. Second, we adopt an iterative multi-stage pruning strategy that recalculates importance scores after each partial prune, mitigating the greedy errors seen in one-shot methods. Experiments across multiple LLMs and benchmarks show that WIP preserves perplexity and zero-shot accuracy better than existing techniques, especially at high sparsities. Additionally, our 2:4 semi-structured pruned models achieve real-world inference speedups of up to 1.88\(\times \) on GPUs, underscoring WIP’s practicality for resource-constrained environments. Our code is publicly available at https://github.com/truongdo619/WIP.
External IDs:dblp:conf/nldb/DoSN25
Loading