Keywords: LLM Compression, Semi-Structured Pruning, Differentiable Subset Sampling, Learnable Sparsity
Abstract: The rapid growth of large language models (LLMs) has driven the need for efficient post-training optimization techniques for reducing computational and memory demands while preserving performance. Semi-structured pruning, which enforces hardware-compatible sparsity patterns like N:M sparsity, offers a balanced approach for accelerating inference. In this study, we introduce SUSI(Semi-structured prUning via Subset samplIng), a novel semi-structured pruning method that leverages the weighted reservoir and differentiable subset sampling to learn high-quality N:M sparsity masks with minimal computational cost. Compared to other learnable mask methods (i.e., MaskLLM), which increase parameter complexity, SUSI reduces trainable parameters by up to 1.5× for the 2:4 sparsity, enabling efficient deployment on hardware optimized for sparse computation. We evaluate SUSI on three OPT model variants (125M, 350M, and 1.3B parameters) using benchmarks including Wikitext-2 for perplexity and zero-shot NLP tasks (e.g., ARC, HellaSwag, PIQA, RACE, SciQ). SUSI consistently surpasses baselines such as SparseGPT, Wanda, and MaskLLM in perplexity while maintaining competitive zero-shot accuracy across various benchmarks. These results establish SUSI as a robust and practical solution for compressing LLMs, facilitating efficient deployment in resource-constrained environments.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 10861
Loading