Dynamic Sparse Training of Diagonally Sparse Networks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A method to train diagonal sparse neural networks and accelerate them on GPUs.
Abstract: Recent advances in Dynamic Sparse Training (DST) have pushed the frontier of sparse neural network training in structured and unstructured contexts, matching dense-model performance while drastically reducing parameter counts to facilitate model scaling. However, unstructured sparsity often fails to translate into practical speedups on modern hardware. To address this shortcoming, we propose DynaDiag, a novel structured sparse-to-sparse DST method that performs at par with unstructured sparsity. DynaDiag enforces a diagonal sparsity pattern throughout training and preserves sparse computation in forward and backward passes. We further leverage the diagonal structure to accelerate computation via a custom CUDA kernel, rendering the method hardware-friendly. Empirical evaluations on diverse neural architectures demonstrate that our method maintains accuracy on par with unstructured counterparts while benefiting from tangible computational gains. Notably, with 90\% sparse linear layers in ViTs, we observe up to a 3.13x speedup in online inference without sacrificing model performance and a 1.59x speedup in training on a GPU compared to equivalent unstructured layers.
Lay Summary: Modern AI models keep getting bigger, which makes them expensive to train and slow to run. Much of that cost comes from doing math on millions of weights that the model doesn’t really need. Earlier research tried “pruning” away unnecessary weights, but the usual random‑looking patterns are not conducive to today’s computer chips-based acceleration, so the promised speed‑ups rarely appear. Our study introduces DynaDiag, a new way to keep only carefully chosen diagonal stripes of weights inside each layer. During training the algorithm constantly re‑evaluates which diagonals matter most and updates them, all while keeping the tidy diagonal layout that chips can process quickly. We also wrote custom GPU code to make sure the hardware takes full advantage. In tests on vision transformers and language models, DynaDiag kept virtually the same accuracy as the best unstructured method methods but ran up to three times faster for inference and about 1.6 × faster during training. Why does this matter? Faster, lighter models cut energy use, lower cloud costs. In short, DynaDiag helps make powerful AI more efficient, affordable, and widely accessible.
Primary Area: Deep Learning->Algorithms
Keywords: Structured Sparsity; Sparse Neural Networks; Dynamic Sparse Training
Submission Number: 2247
Loading