DELTA4: Sparse Matrix-Vector Multiplication for Low Sparsity

Published: 03 Mar 2026, Last Modified: 01 Apr 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: sparse neural networks, sparse inference, unstructured sparsity, spmv, gpu kernel, sparse matrix vector multiplication
TL;DR: We implemented efficient sparse matrix vector multiplication on GPU that enables efficient inference of sparse nural networks and practical gains even at 50% unstructured sparsity.
Abstract: Sparse Large Language Models (LLMs) promise substantial efficiency gains, but realizing these benefits requires advances in both post-training methods and inference systems. Although unstructured pruning techniques can identify sparse models, existing Sparse Matrix-Vector Multiplication (SpMV) methods perform poorly under the low, unstructured sparsity ($30-90\\%$) these methods produce, limiting their practical deployment. We propose **DELTA4-SpMV**, a GPU-optimized format and kernel co-designed to unlock the potential of post-training sparsificaion. By reducing storage overhead while remaining compatible with the GPU’s execution model, DELTA4 enables efficient SpMV for unstructured sparsity without specialized hardware units or precomputation. At $50\%$ sparsity, DELTA4 is the first approach to achieve $1.5\times$ memory reduction and $1.2-1.5\times$ speedup over the dense baseline as well as substantial improvements over other SpMV methods: cuSPARSE ($2.8-13.0\times$), Sputnik ($1.9-2.6\times$), and DASP ($2.2-2.5\times$). Applied to LLM via post-training pruning Wanda, our approach delivers $1.5\times$ faster inference at fp16 precision and requires $1.5\times$ less memory at sparsity $50\\%$. As a result, **unstructured sparse neural networks at $50\%$ sparsity become practical** for real-world LLM workloads and **unstructured sparsity yields practical improvements over structured 2:4 sparsity**.
Submission Number: 35
Loading