Learning Semi-Structured Sparsity for LLMs via Shared and Context-Aware Hypernetwork

Learning Semi-Structured Sparsity for LLMs via Shared and Context-Aware Hypernetwork

ICLR 2026 Conference Submission18483 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM pruning, semi-structured sparsity, hypernetwork, continual learning

TL;DR: We propose a resource-efficient framework, HyperPrune, that uses a lightweight hypernetwork to create structured sparse masks for large language models, achieving a good balance between accuracy and efficiency.

Abstract: Large Language Models (LLMs) achieve state-of-the-art performance but are costly to deploy in resource-constrained environments. Pruning with $n:m$ semi-structured sparsity reduces computation and enables hardware acceleration, yet existing methods face a trade-off: one-shot approaches are efficient but heuristic, while optimization-based methods are accurate but expensive. We introduce \textbf{HyperPrune}, a resource-efficient framework that directly optimizes $n:m$ sparsity. A lightweight hypernetwork, shared across layers and conditioned on learnable embeddings, generates structured masks in a one-shot, layer-wise manner. \textit{Continual pruning} preserves cross-layer knowledge, and \textit{feature outlier regularization} retains critical activations, unifying the strengths of heuristic and optimization-based methods. Experiments on LLaMA-7B to 70B show state-of-the-art accuracy–sparsity trade-offs on a single A100 GPU, achieving higher efficiency, accuracy, and scalability than prior approaches. HyperPrune offers a practical, scalable, and hardware-friendly solution for structured LLM pruning.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 18483

Loading