Keywords: Foundation models; Structured Sparsity; Sparse Neural Networks; Dynamic Sparse Training
TL;DR: Training structure sparse neural networks with permutations to bridge the gap with unstructured sparsity.
Abstract: Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of \emph{expressivity}: whereas a dense layer can realise every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block, or $N{:}M$ layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures—block, $N{:}M$ and diagonals—we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90–95\% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains upto 1.21$\times$ and infers up to $2.9\times$ faster. The results position \emph{structure + learned permutation} as a sweet-spot between accuracy and efficiency.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9607
Loading