everyone
since 05 Mar 2025">EveryoneRevisionsBibTeXCC BY 4.0
In this paper, we demonstrate how to apply 2:4 sparsity, a hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuraccy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. We also discuss the benefits of combining 2:4 sparsity with fp8 quantization to maximize efficiency gains. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.