Keywords: LLM pruning, semi-structured sparsity, hypernetwork, continual learning
TL;DR: We propose a resource-efficient framework, HyperPrune, that uses a lightweight hypernetwork to create structured sparse masks for large language models, achieving a good balance between accuracy and efficiency.
Abstract: Large Language Models (LLMs) achieve state-of-the-art performance but are costly to deploy in resource-constrained environments. Pruning with $n:m$ semi-structured sparsity reduces computation and enables hardware acceleration, yet existing methods face a trade-off: one-shot approaches are efficient but heuristic, while optimization-based methods are accurate but expensive.
We introduce \textbf{HyperPrune}, a resource-efficient framework that directly optimizes $n:m$ sparsity. A lightweight hypernetwork, shared across layers and conditioned on learnable embeddings, generates structured masks in a one-shot, layer-wise manner. \textit{Continual pruning} preserves cross-layer knowledge, and \textit{feature outlier regularization} retains critical activations, unifying the strengths of heuristic and optimization-based methods.
Experiments on LLaMA-7B to 70B show state-of-the-art accuracy–sparsity trade-offs on a single A100 GPU, achieving higher efficiency, accuracy, and scalability than prior approaches. HyperPrune offers a practical, scalable, and hardware-friendly solution for structured LLM pruning.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 18483
Loading