Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

Published: 05 Mar 2025, Last Modified: 18 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: sparsity, LLMs, machine learning
TL;DR: This paper explores the application of 2:4 sparsity to accelerate LLM training and inference by leveraging intrinsic sparsity in activation functions, rather than dense model weights.
Abstract:

In this paper, we demonstrate how to apply 2:4 sparsity, a hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuraccy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. We also discuss the benefits of combining 2:4 sparsity with fp8 quantization to maximize efficiency gains. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 47
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview