Keywords: Semi-Structured Pruning, Variational Mask Learning, Differentiable Subset Sampling, Model Compression
Abstract: Semi-structured $N$:$M$ sparsity has emerged as a practical direction for accelerating large language models (LLMs). However, existing learnable-mask approaches incur substantial parameter and memory overhead, limiting their scalability to large models and aggressive sparsity regimes. In this work, we revisit $N$:$M$ pruning from a perspective that reconciles efficiency with scalability. We propose SUSI, Semi-structured prUning via Subset samplIng, a lightweight semi-structured pruning framework that learns sparsity masks through differentiable subset sampling via weighted reservoir sampling. Unlike prior methods that model full categorical distributions over all feasible $N$:$M$ patterns, SUSI reformulates sparsity mask learning as a sampling without replacement from a compact set of logits, reducing trainable parameters from combinatorial complexity to $\mathcal{O}\left(M\right)$. As a result, SUSI requires 1.5–8.75$\times$ fewer learnable parameters and significantly lower memory cost, while remaining fully aligned with hardware-friendly sparsity patterns. Extensive evaluations across multiple scales of the Qwen2.5 LLM family (0.5-7B parameters) demonstrate that SUSI achieves competitive performance with strong memory efficiency, stability across random seeds, and scalability to more aggressive $N$:$M$ sparsity patterns, offering a practical path toward efficient LLM deployment.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: prunning, LLM Efficiency, parameter-efficient-training
Contribution Types: Approaches low compute settings-efficiency, Theory
Languages Studied: English
Submission Number: 5190
Loading