Efficient Dynamic Structured Sparse Training with Learned Shuffles

Abhishek Tyagi; Arjun Iyer; Liam Young; William H Renninger; Christopher Kanan; Yuhao Zhu

Efficient Dynamic Structured Sparse Training with Learned Shuffles

Abhishek Tyagi, Arjun Iyer, Liam Young, William H Renninger, Christopher Kanan, Yuhao Zhu

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Foundation models; Structured Sparsity; Sparse Neural Networks; Dynamic Sparse Training

TL;DR: Training structure sparse neural networks with permutations to bridge the gap with unstructured sparsity.

Abstract: Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of \emph{expressivity}: whereas a dense layer can realise every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block, or $N{:}M$ layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures—block, $N{:}M$ and diagonals—we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90–95\% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains upto 1.21$\times$ and infers up to $2.9\times$ faster. The results position \emph{structure + learned permutation} as a sweet-spot between accuracy and efficiency.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 9607

Loading