Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

Daniel HAZIZA; Timothy Chou; Dhruv Choudhary; Jesse Cai; Luca Wehrstedt; Francisco Massa; Jiecao Yu; Geonhwa Jeong; Supriya Rao; Patrick Labatut

Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

Daniel HAZIZA, Timothy Chou, Dhruv Choudhary, Jesse Cai, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut

Published: 05 Mar 2025, Last Modified: 18 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: sparsity, LLMs, machine learning

TL;DR: This paper explores the application of 2:4 sparsity to accelerate LLM training and inference by leveraging intrinsic sparsity in activation functions, rather than dense model weights.

Abstract: In this paper, we demonstrate how to apply 2:4 sparsity, a hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuraccy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. We also discuss the benefits of combining 2:4 sparsity with fp8 quantization to maximize efficiency gains. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 47

Loading