In this paper, we demonstrate how to apply 2:4 sparsity, a hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuraccy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. We also discuss the benefits of combining 2:4 sparsity with fp8 quantization to maximize efficiency gains. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.
Track: long paper (up to 4 pages)
Keywords: sparsity, LLMs, machine learning
TL;DR: This paper explores the application of 2:4 sparsity to accelerate LLM training and inference by leveraging intrinsic sparsity in activation functions, rather than dense model weights.
Abstract:
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 47
Loading